12 April 2024
Key takeways
DBT is a transformation workflow. The acronym stands for data build tool.
As such, DBT handles the T from ELT, but actually, it’s not dbt that does the transformation. You will need to have a Postgres DB, SQL Server, Redshift, Databricks, Snowflake, or GCP (and more) so that dbt can send your transformation processes to be executed by the DB engine.
DBT takes care of orchestration of the transformation processes (models), sending them to be executed on the DB engine you configured. That means that dbt uses SQL (where the SQL flavour depends on the engine you connect it to).
DBT enables data analysts and data engineers to automate the testing and deployment of the data transformation process. This is especially useful because many companies have increasingly complex business logic behind their reporting data. The dbt tool keeps a record of all changes made to the underlying logic and makes it easy to trace data and update or fix the pipeline through version control.
Why DBT
DBT adds much more than just being a transformation process. Follows a quick list of what you gain when using dbt:
Work with materialisations and then adjust for what you need
In DBT, you work with models that can be materialised as tables, views, incremental or ephemeral. To change the materialization type, you don’t need to rewrite the DDL like you are used to when working with SQL; you just need to change the configuration for that mode, and DBT takes care of it the next time it runs.
Data lineage from the start
A model is nothing more than an SQL statement, but with dbt, you can use the ref keyword to refer to an upstream model. This allows you to determine the execution order of your models, execute your transformations in stages, and extract the complete lineage of your dbt projects.
Document your models
With dbt, your documentation is done where it belongs, close to your models, and is version-controlled. Since the project uses the same git repo for versioning, you and your fellow developers will have access to all the documentation you and your fellow developers created locally.
You can easily share it also by publishing it to a static website like the one below if you use DBT core (open source) or simply by running DBT on DBT cloud (paid subscription)
Reuse your code and use control structures in your code
SQL files in dbt can contain Jinja, a lightweight templating language. Using Jinja in SQL provides a way to use control structures in your queries. For example, if statements and for loops. It is very useful to write repetitive SQL code like this:
In the above, imagine you have to write an SQL with a big list of payment methods. With Jinja, you could just add them to the list, and the for loop would take care of it.
You can also create macros that you or your colleagues can reuse according to the needs, thus eliminating the need to repeat code throughout the project.
Test your models
Tests are tightly integrated into the dbt framework. You can use assertions on your code (generic or custom) to test the results generated by a model and act accordingly, either failing the execution or just warning about it.
Additionally, you can integrate automated testing packages into dbt (like dbt-expectations), allowing you to expand the scope of tests you can execute on your models.
Use external packages to help your development
DBT’s support for external packages in SQL and Python allows you to build more complex and powerful data pipelines, like using Python’s pandas’ library to make data transformations, for instance:
Options to use DBT
DBT Core
DBT Core is the open-source version of DBT. You can install it on your local machine using Homebrew (macOs or linux) or pip using the adapter you plan to use. For instance, DBT -Postgres. You can also install it using Docker or from the source git code.
With DBT Core, you manage your dbt infrastructure, versioning, and, depending on your use case, an orchestration platform to run and monitor your jobs on a schedule.
Refer to https://docs.getdbt.com/docs/core/installation for installation instructions.
DBT Cloud
DBT Cloud is a cloud-based service that provides additional functionality on top of DBT Core.
You won’t have to worry about infrastructure, software installation and managing your database connections/credentials. It provides code versioning + collaboration, and sharing options. You will have access to automated scheduling and orchestration and advanced analytics and reporting, which will allow you to track the performance of your data transformations and identify issues and bottlenecks.
It does come with a cost and may not be necessary for all use cases.
Refer to https://www.getdbt.com/pricing/ for pricing on this offer.