I have a large data set that is being shuffle read mostly on a single executor: https://i.stack.imgur.com/tpi7K.png
I have materialized all previous operations up until the last one: writing to disk.
What I have tried so far is various amount of df.repartition(n)
but I am not able to break up the hundreds of millions of rows sitting on this single executor.
sampleBy
I think is the culprit -- I want to sample values of my target class evenly but the distribution is heavily imbalanced. How can I break up the partitions after sampling?