4

I've got a pandas dataframe called data_clean. It looks like this: enter image description here

I want to convert it to a Spark dataframe, so I use the createDataFrame() method: sparkDF = spark.createDataFrame(data_clean)

However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of

sparkDF.printSchema()
sparkDF.show()

is

root
 |-- transcript: string (nullable = true)

+--------------------+
|          transcript|
+--------------------+
|ladies and gentle...|
|thank you thank y...|
| all right thank ...|
|                    |
|this is dave he t...|
|                    |
|   ladies and gen...|
|   ladies and gen...|
|armed with boyish...|
|introfade the mus...|
|wow hey thank you...|
|hello hello how y...|
+--------------------+

The docs say createDataFrame() can take a pandas.DataFrame as an input. I'm using Spark version '3.0.1'.

Other questions on SO related to this don't mention this problem of the index column disappearing:

I'm probably missing something obvious, but how do I get to keep the index column when I convert from a pandas dataframe to a PySpark dataframe?

Yann Stoneman
  • 953
  • 11
  • 35

2 Answers2

4

Use Pandas dataframe's reset_index method when converting to Spark dataframe. You can also use rename_axis to name it.

sparkDF = spark.createDataFrame(data_clean.rename_axis('name').reset_index())
AdibP
  • 2,819
  • 1
  • 10
  • 24
4

Spark DataFrame has no concept of index, so if you want to preserve it, you have to assign it to a column first using reset_index in a pandas dataframe

You can also use inplace to avoid additional memory overhead while resting the index

df.reset_index(drop=False,inplace=True)

sparkDF = sqlContext.createDataFrame(df)
Vaebhav
  • 4,672
  • 1
  • 13
  • 33
  • 1
    Both AdibP and Vaebhav's answers worked. I "accepted" the latter because of the explanation that "Spark DataFrame has no concept of an index", which made it really clear to me. Both approaches worked, and I liked the succinctness aspect and the rename_axis() feature of AdibP's answer. – Yann Stoneman Aug 01 '21 at 12:36