How to avoid unnecessary shuffle in pyspark?

Question

I have two CSVs. df_sales, df_products. I want use pyspark to:

Join df_sales and df_products on product_id. df_merged = df_sales.join(df_products,df_sales.product_id==df_products.product_id,"inner")
Compute the summation of df_sales.num_pieces_sold per product. df_sales.groupby("product_id").agg(sum("num_pieces_sold"))

Both 1 and 2 would require the df_sales to be shuffled on product_id

How can I avoid shuffling df_sales 2 times?

If you grouped the df resulting from the join I wouldn't expect a second shuffle. Perhaps this doesn't meet your need? — Chris, Dec 28 '22 at 10:59
@Chris - After join I would apply a filter as well which would get optimized as the first operation and would be applied on the map side itself. So it wont be the same thing — figs_and_nuts, Dec 28 '22 at 11:05
Sounds like you have two actions. I would expect`persist`ing `df_merged` would avoid a second shuffle and prevent filter from being optimized as the first transformation. — Chris, Dec 28 '22 at 11:19
if `df_products` is considerably smaller, you coud broadcast it while joining — samkart, Dec 28 '22 at 14:07
THis is a classic https://xyproblem.info/ you aren't telling us the issues just asking for help answer to a solution you have chosen. Could you explain the problem? — Matt Andruff, Dec 28 '22 at 16:55
Have you enabled adaptive query? `spark.conf.set("spark.sql.adaptive.enabled",true)` — Matt Andruff, Dec 28 '22 at 16:58

score 1 · Accepted Answer · answered Dec 31 '22 at 17:00

One solution to do what you ask would be to use repartition to shuffle the dataframe once, and then cache to keep the result in memory:

cached_df_sales = df_sales.repartition("product_id").cache()

# and then do your work
cached_df_sales\
    .join(df_products,cached_df_sales.product_id==df_products.product_id,"inner")
cached_df_sales.groupby("product_id").agg(sum("num_pieces_sold"))

However, I am not sure this is a good idea. Depending on its size, caching the entire df_sales dataframe might take a lot of memory. Also, the groupBy will only shuffle two columns of the dataframe, which could turn out to be rather inexpensive. I would start by making sure of that before trying to avoid a shuffle.

More generally, before trying to optimize anything, write it simply, run it, see what takes time and focus on that.

Yeah, you are right. I actually had tried caching the repartitioned data but it ended up being significantly slower than not caching it. I had read it somewhere that the output of a shuffle operation is always persisted on the reduce side and used in case it can be used later on. I wonder why that is not happening in my example — figs_and_nuts, Jan 02 '23 at 08:13

How to avoid unnecessary shuffle in pyspark?

1 Answers1