Converting pandas dataframe to PySpark dataframe drops index

Question

I've got a pandas dataframe called data_clean. It looks like this:

I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean)

However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of

sparkDF.printSchema()
sparkDF.show()

is

root
 |-- transcript: string (nullable = true)

+--------------------+
|          transcript|
+--------------------+
|ladies and gentle...|
|thank you thank y...|
| all right thank ...|
|                    |
|this is dave he t...|
|                    |
|   ladies and gen...|
|   ladies and gen...|
|armed with boyish...|
|introfade the mus...|
|wow hey thank you...|
|hello hello how y...|
+--------------------+

The docs say createDataFrame() can take a pandas.DataFrame as an input. I'm using Spark version '3.0.1'.

Other questions on SO related to this don't mention this problem of the index column disappearing:

This one about converting Pandas to Pyspark doesn't mention this issue of the index column disappearing.
Same with this one
And this one relates to data dropping during conversion, but is more about window functions.

I'm probably missing something obvious, but how do I get to keep the index column when I convert from a pandas dataframe to a PySpark dataframe?

score 4 · Answer 1 · answered Aug 01 '21 at 02:22

4

Use Pandas dataframe's reset_index method when converting to Spark dataframe. You can also use rename_axis to name it.

sparkDF = spark.createDataFrame(data_clean.rename_axis('name').reset_index())

answered Aug 01 '21 at 02:22

AdibP

2,819
1
10
24

Vaebhav · Accepted Answer · 2021-08-01T05:51:31.263

4

Spark DataFrame has no concept of index, so if you want to preserve it, you have to assign it to a column first using reset_index in a pandas dataframe

You can also use inplace to avoid additional memory overhead while resting the index

df.reset_index(drop=False,inplace=True)

sparkDF = sqlContext.createDataFrame(df)

edited Aug 01 '21 at 05:51

answered Aug 01 '21 at 05:18

Vaebhav

4,672
1
13
33

1

Both AdibP and Vaebhav's answers worked. I "accepted" the latter because of the explanation that "Spark DataFrame has no concept of an index", which made it really clear to me. Both approaches worked, and I liked the succinctness aspect and the rename_axis() feature of AdibP's answer. – Yann Stoneman Aug 01 '21 at 12:36

Converting pandas dataframe to PySpark dataframe drops index

2 Answers2

Linked