0

I have a large data set that is being shuffle read mostly on a single executor: https://i.stack.imgur.com/tpi7K.png

I have materialized all previous operations up until the last one: writing to disk.

What I have tried so far is various amount of df.repartition(n) but I am not able to break up the hundreds of millions of rows sitting on this single executor.

sampleBy I think is the culprit -- I want to sample values of my target class evenly but the distribution is heavily imbalanced. How can I break up the partitions after sampling?

John Stud
  • 1,506
  • 23
  • 46

1 Answers1

1

You probably have data skewed problem. you can try adding salt in your operation.

exemple: spark: How does salting work in dealing with skewed data

maxime G
  • 1,660
  • 1
  • 10
  • 27