68

Is there a way to convert a Spark Df (not RDD) to pandas DF

I tried the following:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

%pyspark
pandas_df = some_df.toPandas()

Error:

 NameError: name 'some_df' is not defined

Any suggestions.

data_person
  • 4,194
  • 7
  • 40
  • 75

3 Answers3

89

following should work

Sample DataFrame

    some_df = sc.parallelize([
     ("A", "no"),
     ("B", "yes"),
     ("B", "yes"),
     ("B", "no")]
     ).toDF(["user_id", "phone_number"])

Converting DataFrame to Pandas DataFrame

    pandas_df = some_df.toPandas()
Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137
  • The `toDF(...)` of the answer is a red herring and should be removed for clarity, IMO. It's already present in the question. That is why I've updated the below answer instead. – ijoseph Dec 27 '19 at 20:43
  • what "sc" stands for in this case? – Gabriel Apr 26 '21 at 12:40
  • 2
    @Gabriel it's spark context – Gaurang Shah Apr 26 '21 at 14:22
  • Thank you for the answer. Have tried applying this to my code on pySpark 3.2.0 and I get an error, that a second parameter, `c` is now required for function `parallelize` based on . Tried to add a constant `c` with ```example_df = sc\ .parallelize([ ("A", "no"), ("B", "yes"), ("B", "yes"), ("B", "no")], c=4)\ .toDF( ["user_id", "phone_number"] ) ``` to get another error: ```AttributeError: 'list' object has no attribute 'defaultParallelism'``` – Curious Watcher Dec 27 '21 at 10:10
37

In my case the following conversion from spark dataframe to pandas dataframe worked:

pandas_df = spark_df.select("*").toPandas()
Inna
  • 663
  • 8
  • 12
  • 9
    there is no need to put `select("*")` on `df` unless you want some specific columns. This is not going to affect the performance as it's lazy execution and not gonna do anything. – Gaurang Shah Aug 13 '19 at 13:33
  • 2
    For some reason, the solution from @Inna was the only one that worked on my dataframe. No conversion was possible except with selecting all columns beforehand. The data type was the same as usually, but I had previously applied a UDF. – DataBach Apr 02 '20 at 11:41
  • I am using this but most of my spark decimal columns are converting to object in pandas instead of float. I have 100+ columns. Is there a way this type casting can be modified? – Resham Wadhwa Apr 09 '21 at 12:01
  • You can write a function and type cast it – Scope Oct 18 '21 at 18:29
15

Converting spark data frame to pandas can take time if you have large data frame. So you can use something like below:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pd_df = df_spark.toPandas()

I have tried this in DataBricks.

Jaimil Patel
  • 1,301
  • 6
  • 13
Shikha
  • 229
  • 3
  • 5
  • The ```spark.sql.execution.arrow.enabled``` option is highly recommended, especially with ```pyspark.pandas``` in the upcoming spark 3.2 release. – RndmSymbl Oct 14 '21 at 12:13
  • 2
    The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. – Gangadhar Kadam Mar 06 '22 at 04:01
  • 1
    Can you please explain why it makes more efficient? – notilas Oct 28 '22 at 03:36