Is there a way to improve the performance of pyspark substract?

Question

I'm having two DF, each reads 1 TB data. Below code runs very slow. Is there a way to improve it's performance?

diffDF = df1.subtract(df2)

Restructure the question based on the guidelines here to better support you https://stackoverflow.com/a/48427186/7989581 — Nithish, Oct 26 '22 at 08:12

Grisha Weintraub · Answer 1 · 2022-11-07T08:22:37.640

In general, if you have two large datasets that you must shuffle you can't do much to improve the performance (except of configurations tuning).

However, depending on the data and specific use case, you can try the following mitigations:

Assuming you have some id column(s) that uniquely define each record in your datasets, instead of except/subtract you can use left anti-join that might be faster (see Any difference between left anti join and except in Spark?).
In some cases if you can eliminate irrelevant records from df2 before the join and keep a relatively small number of ids to join, you may be able to perform broadcast join and that for sure will significantly improve the performance.

1 Answers1