I have a pandas dataframe consisting of 180M rows and 4 columns (all integers). I saved it as a pickle file and the file is 5.8GB. I'm trying to convert the pandas dataframe to pyspark dataframe using spark_X = spark.createDataFrame(X)
, but keep getting a "out of memory" error.
The error snippet is
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space
I have over 200GB of memory and I don't think a lack of physical memory is the issue. I read that there are multiple memory limitations, e.g. driver memory - could this be the cause?
How can I resolve or workaround this?