Introduction:
Dear Network, I am overwhelmed to publish this article today. The article is about dbt (data-built tool). dbt is a opensource python package. Its known as Analytics Engineering Workflow managment. But, I would like to give my own perspective about dbt. To my context dbt is a platform agnostic Data Model Management Framework.
I have worked several data analytics projects. I have used lot of ways to manage data analytics projects such as metadata driven, python object oriented programming, pyhon task decoraters, etc. Out of all the ways, I think dbt is the best way to go forward. It addresses most of the challenges in analytics project. Every data warehouse, lakehouse, bigdata, data engineering, machine learning and data science teams should use. I am sure, you will agree, after using it.
Data Model Management Framework,
Data – Could be of any leading data platform such as Snowflake, Databricks, Fabric, BigQuery, Redshift. For each platform there is a separate dbt software available. Bingo, dbt works the same way irrespective of underneath data platform.
Model – Model is interesting viewpoint to learn and understand. A model is code (sql or python) which would result primarily one single dataset which could be materialized as Table / View / Temp View / Logical View / Other data objects as well (with help of macros). But strictly recommended to not be a procedure / function which results in multiple dataset outputs.
Each model (.sql/.py code file) has one data object created with same name of the model file. A model (.sql/.py code file) has 1 to 1 relation with data object. This is going to help you in model management a lot.
In Data engineering, Datasets are most created objects (like variables 😊), but least document, tested, cared for lineage. With dbt, model is the heart of your application. All your datasets are given highest priority in terms of code visibility/sharing, documentation, test, lineage, security, etc
Dbt would maintain a lineage graph, based on the model dependencies. Models are Nodes in a lineage graph. Collection of Nodes with dependency with flow direction is Directed Acyclic Graph (DAG)
Management – You should be easily able to manage with simple dbt commands to Define, Create, Modify, Delete, Test, Document, Secure and Deploy data object on data platform of your choice. Structure your code uniform across the team. Create cross functional team. Share and reuse your code. Maintain all as Readable code, Version manage, Document and Host for end users to browse the data catalog easily, show Lineage and much more.
Core management capabilities,
- Snapshot by dbt – Slowly changing dimension with simple configurations
- Execute pre and post execution scripts (with hooks)
- Add external objects in lineage graph (exposure)
- Add and maintain reference data (seed files)
- To group your nodes to simplify the viewpoint of your DAG (groups)
Framework – Uniform way of working across the team. Start your analytics project with standards, Quick understanding of application models, relatively quick and easy change request, reusable and sharable code, tested, documented, secured and overall, a happy data application life cycle
Software Installation and Setup
- Install Python – For snowpark its mandatory to use python <3.11 and for dbt its mandatory to use 3.10. Attached link goes to Python 3.10.11
- Install Git
- Install VS Code
- Enable extenstions dbt Power User and Snowflake for VS Code – You need this at every step of your dbt project life cycle
To ssociate .sql files between dbt and snowflake extension, search of File Associations from VSCode Settings. Keep jinja-sql for *.sql files and keep snowflake-sql for *.sfsql. Well, change as you like
Create virtual environment and dbt installation
# Create virtual environment C:\Your_Folder_of_Choice> venv dbt_env If you have multiple Python Version Installed C:\Users\XXX\AppData\Local\Programs\Python\Python38> python.exe -m venv dbt_env # Activate virtual environment C:\Your_Folder_of_Choice> .\dbt_env\Scripts\Activate.ps1 # Upgrade setuptools (dbt_env) C:\Your_Folder_of_Choice>pip install --upgrade setuptools # dbt installation (dbt_env) C:\Your_Folder_of_Choice>pip install dbt-snowflake
dbt initialization
(dbt_env) C:\Your_Folder_of_Choice>dbt ini Enter a name for your project (letters, digits, underscore): <your choice> Which database would you like to use? [1] snowflake (Don't see the one you want? https://docs.getdbt.com/docs/available-adapters) Enter a number: 1 account (https://<this_value>.snowflakecomputing.com): your account user (dev username): [1] password [2] keypair [3] sso Desired authentication type option (enter a number):<select your choice> role (dev role): <your choice> warehouse (warehouse name): <your choice> database (default database that dbt will build objects in): <your choice> schema (default schema that dbt will build objects in): <your choice> threads (1 or more) [1]: <your choice> (Need to check the impact) <time_stamp> Profile test2 written to C:\Users\XXX\.dbt\profiles.yml using target's profile_template.yml and your supplied values. Run 'dbt debug' to validate the connection. <time_stamp> Your new dbt project "Your_choice" was created! For more information on how to configure the profiles.yml file, please consult the dbt documentation here: https://docs.getdbt.com/docs/configure-your-profile One more thing: Need help? Don't hesitate to reach out to us via GitHub issues or on Slack: https://community.getdbt.com/ Happy modeling!
dbt Folder Structure
end of the dbt initialization, you should see the folder structure as below
- Analyses – To keep all your adhoc analyses queries
- Macros – To customize your model deployment. Like to change schema, create other objects
- Models – All your code goes into Models. As explained earlier. One Model file = One Data Object (Dataset / Dataframe / Table / View / Logical View / Temp View / Other Object). You can organize your code as per your need. Start with keeping Folder Names = Schema. Later you can change depending upon your need.
- Seeds – Reference data .csv files. There files will be uploaded to data platform simply from the command dbt seed
- Snapshots – This is a powerful feature. By Simple configuration. You will be able to implement slowly changing dimension
- tests – Place your test block code here.
dbt Key Files
Profile.yml – Very first important file to be aware. It captures all your database connection settings per project. This file would be under Users\xxx\.dbt. Any dbt command such as, dbt debug dbt compile would be executed against your target environment. To have more environments, edit yml file file to add more connections. Keep the target always to target: dev. Best practice to select the environment at the time of dbt commands such as,
dbt debug –target <other environment name>
on same account, you can change database and schema by other configurations from dbt
dbt_project.yml – Next key file to manage your folder path, models, tables, schemas, etc. This file will be included with in your project folder. Most of the controls of dbt project is from this file. You can see how the folders are recognized by dbt. Recommendation is not to change the defaults.
# These configurations specify where dbt should look for different types of files # The `model-paths` config, for example, states that models in this project can be # found in the "models/" directory. You probably won't need to change these! model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"]
You need to use macros to deploy with custom schema
# Configuring model # Full documentation: https://docs.getdbt.com/docs/configuring-models # In this example config, we tell dbt to build all models in the example/ # directory as views. These settings can be overridden in the individual model # files using the `{{ config(...) }}` macro. models: project_x: # Config indicated by + and applies to all files under models/example/ example: +materialized: view orders: +materialized: view +schema: orders customers: # conditional database selection +database: "{{ 'prod_db' if target.name == 'prod' else 'dev_db' }}" +schema: customers
schema.yml – You can have this file per each subfolder within models. This file gives more control to do with specific models.
Configs – Check all the sub menus under configs. You can choose how to apply (at model folder level or at model file level)
Properties – Check all the sub menus under properties. You can choose how to apply (at model folder level or at model file level)
dbt in Action
dbt debug # profiles.yml is found # dbt_project.yml is found # Connection details are displayed # All checks passed! dbt compile # Compiles your model and create sql files under target. # These files would be execute against your database
Take a look at models/example folder,
There are three files
- my_first_dbt_model.sql –> materialized as table as per custom config
- my_second_dbt_model.sql –> materialized as view as per dbt_project.yml
- schema.yml –> Consist of grants and column level test configs.
dbt run # Models are deployed to database along with necessary grants
when you do dbt run, both the models are deployed to database. There are lot of options to dbt run check. With objects deployment, grants are also performed as per schema.yml configuration.
dbt test # Models test scripts are generated and executed on database based on test configuration
Based on schema.yml Unique and not null test are performed on id column. my_first_dbt_model would fail due to uniqueness. This is for test demo purpose. There is a lot of ways you can perform test.
dbt seed # All your reference data from seeds folder are uploaded to database
How many times have you struggled to load reference data to database. Well, dbt seed to rescue. Keep your reference data (csv files) in seeds folder. Execute dbt seed command to load them to database
dbt snapshot # Based on snapshot configuration, source table version history is captured and stored in target table.
A cool feature to implement SCD2 without writing any code. Bit of strugger to develop and maintain this code across teams. Not anymore, dbt will take care of implementing SCD2 for you. You need to keep the right configuration for every source. A sample is shown below. When you hit dbt snapshot command, a snapshot of order is taken into target table orders under snapshots schema. This table will contain dbt_valid_from, dbt_valid_to, dbt_scd_id, dbt_updated_at. Read more details about snapshot.
# dbt codegen, advanced feature to auto generate source and base models dbt run-operation generate_source --args 'schema_name: raw_jaffle_shop'
Most of the analytics projects would already consist a lot of data sources. One of the first action when you work with dbt is to define them. Well, you can place a <source_file_name>.yml within your model folder
Well, You dont want to do this manually. Install dbt-codegen package to help you. Create a packages.yml file and enlist the package name as below,
dbt deps # After installation execute below command dbt run-operation generate_source --args 'schema_name: raw_jaffle_shop'
Install dbt packages with command dbt package. After Installtion, execute dbt run-operation, this will show all the source tables on screen. Copy them and paste into <source_file_name>.yml. There are a few more steps to create base models, which I will cover later (please comment if you are interested).
You can also reference and query the sources in your models. To simply query use,
select top 10 * From {{ source('jaffle_shop','orders') }} # dbt power user tool will dispatch below query to database select top 10 * From database.jaffle_shop.orders
Final Sliver Bullet
dbt docs generate dbt docs serve # Host documents with Lineage
With two simple commands, dbt will generate documents (along with lineage) and serve them from your local machine. A feature whole analytics industry is desperate. Lineage, is a key feature to know how models are combined and collaborated to bring end result. There is a lot of effort put by multiple providers with limited results. dbt lineage is promising future.
Final Thoughts
I have shown only a few limted features of dbt. There is quite a lot to explore. dbt is opensource and python with lot of community support. More and more members will join and more features will be added. I am sure, dbt is the way to go solution for all data analytics projects.
Thanks to dbtlabs and all the contributors to bring this wonderful data analytics solution. dbtlabs have a wonderful cloud solution with advanced features to use readily (without any installations). If you are interested approach me for a free trial.
The best way to predict the future is to create it together. – Joe Echevarria
Please share this article to all your network. We want to create a better dataops world by standardizing and simplifying data management process, enhancing collaboration across teams, implementing efficient and scable data infrastructure, and improving applicaiton code design and development practices.