I'm having two DF, each reads 1 TB data. Below code runs very slow. Is there a way to improve it's performance?
diffDF = df1.subtract(df2)
I'm having two DF, each reads 1 TB data. Below code runs very slow. Is there a way to improve it's performance?
diffDF = df1.subtract(df2)
In general, if you have two large datasets that you must shuffle you can't do much to improve the performance (except of configurations tuning).
However, depending on the data and specific use case, you can try the following mitigations:
except/subtract
you can use left anti-join
that might be faster (see Any difference between left anti join and except in Spark?).df2
before the join and keep a relatively small number of ids to join, you may be able to perform broadcast
join and that for sure will significantly improve the performance.