1

I'm working for a startup and am setting up our analytics tech stack from scratch. As a result of limited resource we're focussing on using 3rd party tools rather than building custom pipelines.

Our stack is as follows:

ELT tool: either Fivetran or Hevo
Data warehouse: BigQuery
Transformations: dbt cloud
Reverse ETL: Hightouch (if we go with Fivetran - hevo has built in reverse ETL)
BI Tool: Tableau

The problem i'm having is:

With either Fivetran or Hevo there's a break in the below workflow whereby we have to switch tools and there's no integration within the tools themselves to trigger jobs sequentially based on the completion of the previous job.

Use case (workflow): load data into the warehouse -> transform using dbt -> reverse etl data back out of the warehouse into a tool like mailchimp for marketing purposes (e.g a list of user id who haven't performed certain actions and therefore we want to send a prompt email to, a list which is produced via a dbt job which runs daily)

Here's how these workflows would look in the respective tools (E = Extract, L = Load, T = Transform)

Hevo: E+L (hevo) -> break in workflow -> T: dbt job (unable to be triggered within the hevo UI) -> break in workflow -> reverse E+L: can be done within the hevo UI but can\t be triggered by a dbt job

Fivetran: E+L (fivetran) -> T: dbt job (can be triggered within fivetran UI) -> break in workflow -> reverse E+L fivetran partner with a company called hightouch but there's no way of triggering the hightouch job based on the completion of the fivetran/dbt job.

We can of course just sync these up in a time based fashion but this means if a previous job fails subsequent jobs still run, meaning incurring unnecessary cost and it would also be good to be able to re-trigger the whole workflow from the last break point once you've de-bugged it.

From reading online I think something like apache airflow could be used for this type of use case but that's all i've got thus far.

Thanks in advance.

3 Answers3

1

You're looking for a data orchestrator. Airflow is the most popular choice, but Dagster and Prefect are newer challengers with some very nice features that are built specifically for managing data pipelines (vs. Airflow, which was built for task pipelines that don't necessarily pass data).

All 3 of these tools are open source, but orchestrators can get complex very quickly, and unless you're comfortable deploying kubernetes and managing complex infrastructure you may want to consider a hosted (paid) solution. (Hosted Airflow is under the brand name Astronomer).

Because of this complexity, you should ask yourself if you really need an orchestrator today, or if you can wait to implement one. There are hacky/brittle ways to coordinate these tools (e.g., cron, GitHub Actions, having downstream tools poll for fresh data, etc.), and at a startup's scale (one-person data team) you may actually be able to move much faster with a hacky solution for some time. Does it really impact your users if there is a 1-hour delay between loading data and transforming it? How much value is added to the business by closing that gap vs. spending your time modeling more data or building more reporting? Realistically for a single person new to the space, you're probably looking at weeks of effort until an orchestrator is adding value; only you will know if there is an ROI on that investment.

tconbeer
  • 4,570
  • 1
  • 9
  • 21
0

I use Dagster for orchestrating multiple dbt projects or dbt models with other data pipeline processes (e.g. database initialization, pyspark, etc.)

Here is a more detailed description and demo code:

0

You could try the following workflow, where you'd need to use a couple more additional tools, but it shouldn't need you any custom engineering effort on orchestration.

E+L (fivetran) -> T: Use Shipyard to trigger a dbt cloud job -> Reverse E+L: Trigger a Hightouch or Census sync on completion of a dbt cloud job

This should run your entire pipeline in a single flow.