I have struck an issue where in I want pandas df created from a spark df, to understand Umlauted characters.
This is a minimal reproducible example:
from pyspark.sql.types import StructType,StructField, StringType
data =[("Citroën",)]
schema = StructType([ \
StructField("car",StringType(),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
The spark df looks like this
+--------
car |
+--------
|Citroën|
I want to convert the spark df into a pandas df. I try this via df.toPandas()
and these are some outputs I get:
pdf = df.toPandas()
print(pdf)
print(pdf["car"].unique())
0 Citro??n
[u'Citro\xc3\xabn']
Question: How do I get Pandas to understand these special characters?
I tried to browse on forums and SO itself. Cannot find anything that works for me. I have tried setting PYTHONIOENCODING=utf8
as suggested by this. Have also tried adding #-*- coding: UTF-8 -*-
to the .py file.
UPDATE 1
Converting the pandas df back to spark:
test_sdf = spark.createDataFrame(pdf)
test_sdf.show()
+--------+
| car|
+--------+
|Citroën|
+--------+