13

Is that possible to

convert from to pd.DataFrame

under %pyspark environment ?

Hello lad
  • 17,344
  • 46
  • 127
  • 200

1 Answers1

35

Try:

spark_df.toPandas()

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

This is only available if Pandas is installed and available.

And if you want the oposite:

spark_df = createDataFrame(pandas_df)
Yaron
  • 10,166
  • 9
  • 45
  • 65
Thiago Baldim
  • 7,362
  • 3
  • 29
  • 51
  • 2
    This is not working if the pandas dataframe is very very big. – Mulgard Nov 20 '17 at 13:05
  • What is the error? – Thiago Baldim Nov 20 '17 at 23:10
  • 2
    java heap out of memory error. – Mulgard Nov 21 '17 at 21:08
  • 1
    The heap for the driver maybe is too low for the size of the DataFrame and is not allowed to store in the JVM memory Try to change the driver memory size. – Thiago Baldim Nov 22 '17 at 04:25
  • 3
    Also remember that Spark Dataframe uses RDD which is basically a distributed dataset spread across all the nodes. So a big data can be processed without issues. However when you convert this big data set into a Pandas dataframe, it will most likely run out of memory as Pandas dataframe is not distributed like the spark one and uses only the driver node's memory and not all the other available nodes. – user1795667 Apr 13 '19 at 22:48
  • I am getting this error going from a valid 364-row by 9-column Pandas dataframe to a pyspark one: `TypeError: field Total Rain Flag: Can not merge type and ` – Pablo Adames Apr 11 '20 at 15:08