I've got a pandas dataframe called data_clean
. It looks like this:
I want to convert it to a Spark dataframe, so I use the createDataFrame() method:
sparkDF = spark.createDataFrame(data_clean)
However, that seems to drop the index column (the one that has the names ali, anthony, bill, etc) from the original dataframe. The output of
sparkDF.printSchema()
sparkDF.show()
is
root
|-- transcript: string (nullable = true)
+--------------------+
| transcript|
+--------------------+
|ladies and gentle...|
|thank you thank y...|
| all right thank ...|
| |
|this is dave he t...|
| |
| ladies and gen...|
| ladies and gen...|
|armed with boyish...|
|introfade the mus...|
|wow hey thank you...|
|hello hello how y...|
+--------------------+
The docs say createDataFrame() can take a pandas.DataFrame
as an input. I'm using Spark version '3.0.1'.
Other questions on SO related to this don't mention this problem of the index column disappearing:
- This one about converting Pandas to Pyspark doesn't mention this issue of the index column disappearing.
- Same with this one
- And this one relates to data dropping during conversion, but is more about window functions.
I'm probably missing something obvious, but how do I get to keep the index column when I convert from a pandas dataframe to a PySpark dataframe?