2

I have recently converted an enormous SAS datastep program to pyspark and I think the query is so large that the Catalyst optimizer causes an OOM error in the driver. I am able to run the query when I increase the driver memory to 256gb, but anything less and the job fails. This happens even when I run on a dataset with very few records.

This query takes a single input dataset, performs transformations on the input columns to generate a new set of columns. There are no joins, just 1000s of intermediate calcualtions to produce a final dataset with ~800 columns.

How can I structure such a large query so that spark can run it with fewer compute resources? I am being deliberatly vague, but the new columns I am producing essentially use F.when and some array operations on columns created from F.split.

My final action query looks like below and results in a logical plan that is enormous.


cols = [<list of 800 column expressions>]

df.select(*cols).write.parquet("<path/to/file>")

I have read plenty about checkpointing and how that truncates the logical plan. Does the plan need to be sent to the executors and it might be too big to fit in their memory? What is a best practice way to structure an enormous query? My first thought is to break it up into many smaller queries and then do a join at the end.

Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • What is the source of the query and where does it reads the data from? I don't see anything with the last final two lines of code you have added to the question.. It would be helpful if you could add the code that reads and does the transformation. – Nikunj Kakadiya Dec 18 '22 at 05:48
  • Maybe the solution is to avoid checkpointing, persisting, and actions like show or collect. pyspark can adapt with memory limits if you don't force it to load all of data. – Amir Hossein Shahdaei Dec 18 '22 at 10:06
  • The source of the query is a python module I wrote that contains 2000 column expressions. Of these, only 800 are used in the final dataframe. But those 800 use the others in intermediate calculations. The logical plan tree that is generated is enormous. – Zelazny7 Dec 18 '22 at 14:05

0 Answers0