0

I have a problem with a last "left" join of a transformation.

The result continually returns me the following error: There is too much data being sent to the driver. 4.0 GiB of serialized data from 10700 tasks exceeds the limit of 4.0 GiB. and not so how to fix it.

The final desired dataset contains 4.5 million rows and shouldn't be complicated to obtain. I have already disabled the join broadcast but to no avail.

Edit for deepening: The join is located at the last line of the code (which involves more or less heavy operations) but this is precisely where it stops. Without this join, in fact, I can build both datasets (df and df2). Then when I execute the join it returns me the error.

df --> ~2.500.000 rows, 3 columns, 24.5 MB size. (Result of F.explode of DATE for each ID

df2 --> ~700.000 rows, 10 columns, 29.5 MB size. (Result of union of some datasets)

df_final = df.join(df2, ['ID', 'DATE'], 'left')

Please help me! Thank u!

Jresearcher
  • 297
  • 3
  • 13

0 Answers0