Spark: How do I break up the data located on 1 particular executor?

Question

I have a large data set that is being shuffle read mostly on a single executor: https://i.stack.imgur.com/tpi7K.png

I have materialized all previous operations up until the last one: writing to disk.

What I have tried so far is various amount of df.repartition(n) but I am not able to break up the hundreds of millions of rows sitting on this single executor.

sampleBy I think is the culprit -- I want to sample values of my target class evenly but the distribution is heavily imbalanced. How can I break up the partitions after sampling?

score 1 · Accepted Answer · answered Jul 04 '23 at 09:41

1

You probably have data skewed problem. you can try adding salt in your operation.

exemple: spark: How does salting work in dealing with skewed data

answered Jul 04 '23 at 09:41

maxime G

1,660
1
10
27

Spark: How do I break up the data located on 1 particular executor?

1 Answers1