I am running a spark job which processes about 2 TB of data. The processing involves:
- Read data (avrò files)
- Explode on a column which is a map type
OrderBy
key from the exploded column- Filter the DataFrame (I have a very small(7) set of keys (call it keyset) that I want to filter the df for). I do a
df.filter(col("key").isin(keyset: _*) )
- I write this df to a parquet (this dataframe is very small)
- Then I filter the original dataframe again for all the key which are not in the keyset
df.filter(!col("key").isin(keyset: _*) )
and write this to a parquet. This is the larger dataset.
The original avro data is about 2TB. The processing takes about 1 hr. I would like to optimize it. I am caching the dataframe after step 3, using shuffle partition size of 6000. min executors = 1000, max = 2000, executor memory = 20 G, executor core = 2. Any other suggestions for optimization ? Would a left join be better performant than filter ?