‘We fear the calm more than the storm’
~Marty Rubin
I am amazed at how metaphorical this quote sounds with today’s data world.
Today with plethora of tool and technological options available, for
building different parts and orchestration of data pipelines. Raging reviews for different tools is making clients to adopt what they want instead of what they actually need.
These tools are incorporated in data pipelines solutions just because it is a known place without giving much thought on complexity these are adding to your solution.
In this blog lets see how can we create full fledged stand alone data pipeline
in snowflake.
Idea is to leverage the functionality of streams, procedures and tasks in snowflake to built a pipeline .
Use case :
Different sources putting data into snowflake at some frequency.
End goal:
Perform transformation on table to get final fact tables .
Snowflake Objects needed:
We can simply leverage 4 objects from Snowflake to automate the process:
1. Table
2. Stored procedure
3. Stream
4. Task
Solution Architecture Diagram:
Solution breakdown:
1.Tables →
Transformation table :
→( sequence_no
, sql_statement
)
- Orchestration table :
→(sequence_no
,transformation_table_name
)
2. Stored procedures →
Create a stored procedure say : orchestrate(orch_tbl string)
which takes orchestration table’s fully qualified name as an argument (orch_tbl
).
Pseudo code:
- Query
orch_tbl
and iterate over the sequence oftransformation_table_name
present. - For each transformation table iterate over
sql_statement
column and execute the transformations.
3.Streams →
Stream can be created over tables to capture new changes example: new records inserted, deleted or updated. This in turn can be leveraged to decide whether to run a task or skip the schedule. Example:
CREATE TASK orchestration_1
WAREHOUSE = transformation_wh
SCHEDULE = '360 minute'
WHEN
SYSTEM$STREAM_HAS_DATA('CHANGE_STREAM')
AS
CALL orchestrate('config_db.control.orch_tbl1');
4.Tasks →
For each of your orchestration you can create a task keep its entry in data_pipelines
table along with the cron at which it is scheduled.
And you are ready with you standalone Snowflake data pipeline.
Conclusion →
Curated minimalism in terms of choosing tools and technologies and designing you solution, can save you a lot of money, effort and time. So choose wisely.
Thinking of moving to the Cloud?
Let us help you on your cloud journey. Reach out to us at Cluephant. Achieve FASTER, BETTER results with Cluephant (Cloud Natives).