Convert a spark DataFrame to pandas DF

Question

Is there a way to convert a Spark Df (not RDD) to pandas DF

I tried the following:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

%pyspark
pandas_df = some_df.toPandas()

Error:

 NameError: name 'some_df' is not defined

Any suggestions.

@user3483203 yep, I created the data frame in the note book with the Spark and Scala interpreter. and used '%pyspark' while trying to convert the DF into pandas DF. — data_person, Jun 21 '18 at 01:04
@RameshMaharjan Yep I use scala. But I am trying to build visualizations for the columns in the Spark DF, for which I couldn't find relevant sources. — data_person, Jun 21 '18 at 03:24
@RameshMaharjan histogram, or any plots for finding the distribution of every column in the DF — data_person, Jun 21 '18 at 03:33
check out http://blog.madhukaraphatak.com/statistical-data-exploration-spark-part-2/ — Ramesh Maharjan, Jun 21 '18 at 03:36
and yes pandas has a lot of features so you can switch to pyspark as scala spark is identical to pyspark but with pyspark you can combine with pandas — Ramesh Maharjan, Jun 21 '18 at 03:46
@RameshMaharjan Can you please help me on this? 'https://stackoverflow.com/questions/51052866/max-in-window-functions' — data_person, Jun 27 '18 at 03:29

Gaurang Shah · Accepted Answer · 2022-09-22T15:57:28.840

89

following should work

Sample DataFrame

    some_df = sc.parallelize([
     ("A", "no"),
     ("B", "yes"),
     ("B", "yes"),
     ("B", "no")]
     ).toDF(["user_id", "phone_number"])

Converting DataFrame to Pandas DataFrame

    pandas_df = some_df.toPandas()

edited Sep 22 '22 at 15:57

answered Jun 21 '18 at 01:43

Gaurang Shah

The `toDF(...)` of the answer is a red herring and should be removed for clarity, IMO. It's already present in the question. That is why I've updated the below answer instead. – ijoseph Dec 27 '19 at 20:43
what "sc" stands for in this case? – Gabriel Apr 26 '21 at 12:40
2

@Gabriel it's spark context – Gaurang Shah Apr 26 '21 at 14:22
Thank you for the answer. Have tried applying this to my code on pySpark 3.2.0 and I get an error, that a second parameter, `c` is now required for function `parallelize` based on . Tried to add a constant `c` with ```example_df = sc\ .parallelize([ ("A", "no"), ("B", "yes"), ("B", "yes"), ("B", "no")], c=4)\ .toDF( ["user_id", "phone_number"] ) ``` to get another error: ```AttributeError: 'list' object has no attribute 'defaultParallelism'``` – Curious Watcher Dec 27 '21 at 10:10

Inna · Answer 2 · 2019-12-16T14:47:24.637

37

In my case the following conversion from spark dataframe to pandas dataframe worked:

pandas_df = spark_df.select("*").toPandas()

edited Dec 16 '19 at 14:47

answered Jul 22 '19 at 13:59

Inna

9

there is no need to put `select("*")` on `df` unless you want some specific columns. This is not going to affect the performance as it's lazy execution and not gonna do anything. – Gaurang Shah Aug 13 '19 at 13:33
2

For some reason, the solution from @Inna was the only one that worked on my dataframe. No conversion was possible except with selecting all columns beforehand. The data type was the same as usually, but I had previously applied a UDF. – DataBach Apr 02 '20 at 11:41
I am using this but most of my spark decimal columns are converting to object in pandas instead of float. I have 100+ columns. Is there a way this type casting can be modified? – Resham Wadhwa Apr 09 '21 at 12:01
You can write a function and type cast it – Scope Oct 18 '21 at 18:29

score 15 · Answer 3 · edited Apr 30 '20 at 11:15

15

Converting spark data frame to pandas can take time if you have large data frame. So you can use something like below:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pd_df = df_spark.toPandas()

I have tried this in DataBricks.

edited Apr 30 '20 at 11:15

Jaimil Patel

answered Apr 29 '20 at 09:12

Shikha

The ```spark.sql.execution.arrow.enabled``` option is highly recommended, especially with ```pyspark.pandas``` in the upcoming spark 3.2 release. – RndmSymbl Oct 14 '21 at 12:13
2

The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. – Gangadhar Kadam Mar 06 '22 at 04:01
1

Can you please explain why it makes more efficient? – notilas Oct 28 '22 at 03:36

3 Answers3