0

I have a pandas dataframe consisting of 180M rows and 4 columns (all integers). I saved it as a pickle file and the file is 5.8GB. I'm trying to convert the pandas dataframe to pyspark dataframe using spark_X = spark.createDataFrame(X), but keep getting a "out of memory" error.

The error snippet is

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space

I have over 200GB of memory and I don't think a lack of physical memory is the issue. I read that there are multiple memory limitations, e.g. driver memory - could this be the cause?

How can I resolve or workaround this?

Rayne
  • 14,247
  • 16
  • 42
  • 59
  • 3
    Did you try any of the suggestions here? https://stackoverflow.com/questions/32336915/pyspark-java-lang-outofmemoryerror-java-heap-space – bzu Aug 12 '22 at 10:03
  • Thanks, I'll give them a try. – Rayne Aug 12 '22 at 12:11
  • @Rayne When you say you have 200GB memory, is it the total resource in your cluster? Also, which mode and what config are you using? – Jonathan Lam Aug 16 '22 at 06:12
  • @Jonathan Yes, this is the physical memory I have. Anyway, I have not encountered this problem after changing the `spark.driver.memory` setting to `32g` – Rayne Aug 16 '22 at 10:31

1 Answers1

0

As suggested by @bzu, the answer here solved my problem.

I did have to manually create the $SPARK_HOME/conf folder and spark-defaults.conf file, though, as they did not exist. Also, I changed the setting to

spark.driver.memory 32g
Rayne
  • 14,247
  • 16
  • 42
  • 59