Convert between spark.SQL DataFrame and pandas DataFrame

Question

Is that possible to

convert from to pd.DataFrame

under %pyspark environment ?

score 35 · Accepted Answer · edited Jan 24 '17 at 11:33

35

Try:

spark_df.toPandas()

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

This is only available if Pandas is installed and available.

And if you want the oposite:

spark_df = createDataFrame(pandas_df)

edited Jan 24 '17 at 11:33

Yaron

answered Jan 24 '17 at 11:22

Thiago Baldim

2

This is not working if the pandas dataframe is very very big. – Mulgard Nov 20 '17 at 13:05
What is the error? – Thiago Baldim Nov 20 '17 at 23:10
2

java heap out of memory error. – Mulgard Nov 21 '17 at 21:08
1

The heap for the driver maybe is too low for the size of the DataFrame and is not allowed to store in the JVM memory Try to change the driver memory size. – Thiago Baldim Nov 22 '17 at 04:25
3

Also remember that Spark Dataframe uses RDD which is basically a distributed dataset spread across all the nodes. So a big data can be processed without issues. However when you convert this big data set into a Pandas dataframe, it will most likely run out of memory as Pandas dataframe is not distributed like the spark one and uses only the driver node's memory and not all the other available nodes. – user1795667 Apr 13 '19 at 22:48
I am getting this error going from a valid 364-row by 9-column Pandas dataframe to a pyspark one: `TypeError: field Total Rain Flag: Can not merge type and ` – Pablo Adames Apr 11 '20 at 15:08

1 Answers1