I had the task of writing about a TB of data (split in 200+ partitions by day) in compressed parquet files again with changed schema. Basically I used spark sql to adjust the schema to my new format. Some columns were casted, others recomputed or split.
I wanted to then save the result with in as few files per day as possible repartition($year,$month,$day).write.partitionBy(year,month,day).parquet(..)
The source of the data was a partitioned table by the same y,m,d columns. So in theory it should have worked without any exchange steps, simple taking one day writing it into its new location with changed data/schema.
I thought the whole job should be network bound read/write to/from s3.
That did not work though. The spark EMR cluster first tried to read all the data, shuffle the data and then write it back out. That was an issue because I did not to set up a massive cluster. Eventually data didn't fit in RAM and then if the disk space filled up with shuffle temp files form the exchange step.
I ended up just splitting the work in terrible slow inefficient serial steps. Which took forever.
TL;DR; How does one handle such primitive (projection) ETL jobs in spark? It is basically just a projection, no joins, only coalesce of partitions.
Read data -> project columns -> write data (partitioned) I want to avoid the exchange step in the DAG, or is this just not possible with spark.