0

I'm having two DF, each reads 1 TB data. Below code runs very slow. Is there a way to improve it's performance?

diffDF = df1.subtract(df2)
Oli
  • 9,766
  • 5
  • 25
  • 46
TommyQu
  • 509
  • 5
  • 18

1 Answers1

0

In general, if you have two large datasets that you must shuffle you can't do much to improve the performance (except of configurations tuning).

However, depending on the data and specific use case, you can try the following mitigations:

  1. Assuming you have some id column(s) that uniquely define each record in your datasets, instead of except/subtract you can use left anti-join that might be faster (see Any difference between left anti join and except in Spark?).
  2. In some cases if you can eliminate irrelevant records from df2 before the join and keep a relatively small number of ids to join, you may be able to perform broadcast join and that for sure will significantly improve the performance.
Grisha Weintraub
  • 7,803
  • 1
  • 25
  • 45