My program works like this:
- Read in a lot of files as dataframes. Among those files there is a group of about 60 files with 5k rows each, where I create a separate Dataframe for each of them, do some simple processing and then union them all into one dataframe which is used for further joins.
- I perform a number of joins and column calculations on a number of dataframes finally which finally results in a target dataframe.
- I save the target dataframe as a Parquet file.
- In the same spark application, I load that Parquet file and do some heavy aggregation followed by multiple self-joins on that dataframe.
- I save the second dataframe as another Parquet file.
The problem
If I have just one file instead of 60 in the group of files I mentioned above, everything works with driver having 8g memory. With 60 files, the first 3 steps work fine, but driver runs out of memory when preparing the second file. Things improve only when I increase the driver's memory to 20g.
The Question
Why is that? When calculating the second file I do not use Dataframes used to calculate the first file so their number and content should not really matter if the size of the first Parquet file remains constant, should it? Do those 60 dataframes get cached somehow and occupy driver's memory? I don't do any caching myself. I also never collect anything. I don't understand why 8g of memory would not be sufficient for Spark driver.