7 March 2024
Key takeways
What business data points are you blind to? What’s the potential left on the table with the data that you’re not storing? What relationship, short and long term, do you want with the technology that can remove these blind spots – rely on a provider, a consultancy company, or staff up internally?
TL;DR: there is not one (easy) path, or we might be out of a job.
With the abundance of data sources – from ERPs and CRMs to social media and IoTs, even start-ups and non-tech businesses understand the value of analytics and cloud services in retaining and acquiring new business by better understanding their audience, the potential for more cost-efficient internal processes and, the possibility of automation in their organization.
However, these smaller – or less tech-savvy – organizations will be cautious about developing and maintaining workloads if they are not specialized.
They will rely on a consultancy company (like us!) to develop and maintain their workloads.
They can use that initiative to grow their team internally by actively participating in the development and taking on some, if not all, the future ongoing maintenance and developments.
As a data analytics specialist, every new client means delving deep into their business to assist in building a data analytics roadmap that delivers short term but is also aligned with a long-term organization-wide vision that considers culture, competencies, goals, commitments and chooses the right processes, technology, and development path.
Development is aligned; what about the tools?
Some important points:
- Managed services can be a great way to start the journey – for organizations with little or no capacity for maintaining workloads and don’t want to be dependent on outsourcing services – but as the number of developers, workload volume, and end users grows, it will become increasingly expensive, so choose managed services accordingly to your organization roadmap.
- For a more customized workload—including purpose-built connectors for legacy systems, less common APIs, and client-specific automation tasks that require joint orchestration with core workloads—you may need to choose services, tools, or outright architecture layers that can increase maintenance effort.
- Where are your sources, where are your users, and where do you want to see the data – transferring data out of the cloud implies a small fee, so it seems intuitive to try and keep source data, transformation processes, and visualization dashboards in the same cloud and region, however – that may reduce the choice of services and potentially result in higher transformation costs or reduce overall product adoption by the dashboards end-users.
- The degree of good cloud lock-in needs to be discussed – an actual cloud-agnostic workload may prove expensive and mostly unnecessary, but some architecture decisions may heavily reduce any potential future migration efforts, such as choosing data analytics tools (extraction processes, business transformation logic, orchestration pipelines, etc.) that can be ported with less refactor effort to another cloud (given a base infrastructure is set up in there) or even into a third party managed services – however, these cloud agnostic tools can mostly either be more expensive or require more maintenance than their “cloud native” counterparts, making it a compromise between cloud lock-in, price and maintenance.
- The degree of acceptable service vendor lock-in also needs discussion – the choice of service for extraction, transformation, orchestration, visualization, cataloguing, and so on will determine the refactoring effort required later to replace it with another service. What can you do? You can choose open-source technologies or technologies with development language that can more easily be translated into other tools (such as SQL and Spark). Still, most importantly, your architecture should be as modular as possible to limit the impact area when replacing any component.
What might a managed services data analytics architecture look like initially?
Starting with the Datawarehouse – there is some good competition on the market – Snowflake, Amazon Redshift, and Google BigQuery are popular services offering different features and pricing options. Hence, it’s critical to understand which features you need and determine your usage pattern to assess the right technology.
DBT is a good choice for creating and maintaining your business models. It’s a framework for creating SQL models with features such as data testing, data lineage, documentation, version control, and abstraction layers that make your project modular and flexible. A well-developed DBT project will promote replacing Datawarehouse technologies in the future as it considerably lowers the rework effort.
DBT is an open-source project; however, there is a Cloud offering with competitive pricing for a small development team – all without worrying about hosting!
If you’re using source data from another team or department and it’s already in storage that can be referenced by your Datawarehouse (such as Amazon RDS, Amazon S3, Azure Storage Account or Google Cloud Storage, etc.), then congratulations as your backend data architecture may start with just these two tools.
However, most projects will require extracting new data. In this case, you can use a managed service like Fivetran or Stich, which will work to host and maintain the connectors; you choose the source, the target and pay-per-use.
If you start facing requests such as – “I want specific models to be refreshed right after certain data is extracted and loaded”, “I want to add custom sources that my extraction tool does not support”, or “I need some custom automated steps to connect a few processes during the ELT process”, then you might need to add another layer into your ecosystem, a flexible orchestrator that can oversee the entire process and fill potential missing gaps, such as the Astronomer – which is essentially a managed version of Apache Airflow.
Many other managed tools are available in the market for various use cases, such as cataloguing, testing, and monitoring, that can be added to this architecture.
The more applications you add, you may have a more challenging time managing your developments, integrations and deployments, as each tool may have its own approach.
If this following sentence ever makes sense to you – “My business feature requests are growing, so is my developer team and the workload size, taking me into higher tiers of the managed services which are becoming too expensive – I want to take on some of the maintenance internally and customize my architecture to my exact needs” – then the following architecture aims to address that:
The first objective is to set up a good base infrastructure in any of the clouds that can scale, is well adopted, allows for any application hosting, and has resource efficiency – Kubernetes is the standard to fulfil these requirements (it’s how most managed services work in the backend) as it will host your applications, share cloud resources between them for efficiency and will assure communication among applications and the outside world.
You don’t need to manage “pure Kubernetes”, though, as each cloud has its own managed service for Kubernetes so that part of the maintenance is assured by the cloud provider (like every managed service – for a small fee), and you can add your cloud agnostic workloads inside – a nice compromise between customization, price, features, and maintenance.
After the base infrastructure is setup, you can then install the applications you need, such as Apache Airflow and DBT Core, among others – and refactor the previous workloads from managed services into the new architecture – if the managed services were chosen with a roadmap like this in mind, the refactor of business rules, and process orchestration will be heavily reduced.
Some benefits of this architecture are the pricing at a larger scale, the ease of creating multiple environments quickly and dynamically as needed—such as various dev environments (one for each development or developer) and proper pre-prod environments for better testing and UAT process—and a more centralized CI/CD pipeline.
Before committing to an architecture, many more considerations must occur, as each client will have unique requirements and blockers.
However, one common theme is that business teams want new features delivered faster and faster. The market for data analytics is evolving rapidly, and new and better offerings are continuously being released, so the importance of being adaptable in this marketplace is more significant than ever.