I am building a open source data stack for a large-scale batch pipeline. The data is later to be used in a ML model that is updated quarterly.
I want to use Airbyte for ingestion and Airflow for generel orchestration.
In general, I want to use modern open source software but I ran into some issues in regards to what to choose for data storage and transformation. First, I thought I might take Cassandra and PySpark but I read in several sources that they are not really compatible or it requires some effort. Then I thought about going with something like dbt and Postgres. But dbt seems only a good choice when it comes to Analytics and not for ML data. E.g. data enrichment is not really well done with SQL in dbt. Postgres might be a bad choice because if I end up with huge data volumes, Postgres will slow down and performance decreases.
Are there any suggestions in regards to which tools I should use?