spark rewrite projection of parquet data without exchange/shuffle

Question

I had the task of writing about a TB of data (split in 200+ partitions by day) in compressed parquet files again with changed schema. Basically I used spark sql to adjust the schema to my new format. Some columns were casted, others recomputed or split.

I wanted to then save the result with in as few files per day as possible repartition($year,$month,$day).write.partitionBy(year,month,day).parquet(..)

The source of the data was a partitioned table by the same y,m,d columns. So in theory it should have worked without any exchange steps, simple taking one day writing it into its new location with changed data/schema.

I thought the whole job should be network bound read/write to/from s3.

That did not work though. The spark EMR cluster first tried to read all the data, shuffle the data and then write it back out. That was an issue because I did not to set up a massive cluster. Eventually data didn't fit in RAM and then if the disk space filled up with shuffle temp files form the exchange step.

I ended up just splitting the work in terrible slow inefficient serial steps. Which took forever.

TL;DR; How does one handle such primitive (projection) ETL jobs in spark? It is basically just a projection, no joins, only coalesce of partitions.

Read data -> project columns -> write data (partitioned) I want to avoid the exchange step in the DAG, or is this just not possible with spark.

Why do you `repartition($year,$month,$day)`? Today partitioning information is not used to optimize shuffles, [only bucketing is](https://stackoverflow.com/q/30995699/8371915). — Alper t. Turker, Jul 11 '18 at 13:10
repartition(y,m,d) to get one partition (file) per day, greatly speeds up msck repair table ops and other s3 list ops if there aren't 200 files per day. I partition because I want to minimize data access, minimize data reads. This has nothing to do with shuffles, In fact I don't want to shuffle or sort the data. I just want each row projected to the new schema and written at the same position, no change basically. (And few output files) — samst, Jul 12 '18 at 14:52
The point is `repartition` is a shuffle and `repartitionBy` on the source doesn't prevent it. — Alper t. Turker, Jul 12 '18 at 15:52

spark rewrite projection of parquet data without exchange/shuffle

0 Answers0