0

I am building a open source data stack for a large-scale batch pipeline. The data is later to be used in a ML model that is updated quarterly.

I want to use Airbyte for ingestion and Airflow for generel orchestration.

In general, I want to use modern open source software but I ran into some issues in regards to what to choose for data storage and transformation. First, I thought I might take Cassandra and PySpark but I read in several sources that they are not really compatible or it requires some effort. Then I thought about going with something like dbt and Postgres. But dbt seems only a good choice when it comes to Analytics and not for ML data. E.g. data enrichment is not really well done with SQL in dbt. Postgres might be a bad choice because if I end up with huge data volumes, Postgres will slow down and performance decreases.

Are there any suggestions in regards to which tools I should use?

1 Answers1

0

I believe there are two problems to be solved and one doesn’t mean you don’t need the other.

You can use dbt to go from raw data to relations ready for DS. At that point you can switch to tools and libraries suited for DS and ML.

Finally, you will need to deploy that ML model and you will need tools for ML Ops.

The reason I keep dbt in the mix is that I presume there are other use cases for the data and SQL along with the other features or dbt like DQ, docs, and lineage are better suited for a wider audience.

noel_g
  • 279
  • 1
  • 6
  • 17