Handling Umlauted characters in PySpark

Question

I have struck an issue where in I want pandas df created from a spark df, to understand Umlauted characters.

This is a minimal reproducible example:

from pyspark.sql.types import StructType,StructField, StringType
data =[("Citroën",)]
schema = StructType([ \
    StructField("car",StringType(),True), \
  ])
df = spark.createDataFrame(data=data,schema=schema)

The spark df looks like this

+--------
car     |
+--------
|Citroën|

I want to convert the spark df into a pandas df. I try this via df.toPandas() and these are some outputs I get:

pdf = df.toPandas()
print(pdf)
print(pdf["car"].unique())

0  Citro??n
[u'Citro\xc3\xabn']

Question: How do I get Pandas to understand these special characters?

I tried to browse on forums and SO itself. Cannot find anything that works for me. I have tried setting PYTHONIOENCODING=utf8 as suggested by this. Have also tried adding #-*- coding: UTF-8 -*- to the .py file.

UPDATE 1

Converting the pandas df back to spark:

test_sdf = spark.createDataFrame(pdf)
test_sdf.show()
+--------+
|     car|
+--------+
|CitroÃ«n|
+--------+

is it possible you're accidentally using Python 2? you should never see `u` prefix in Python 3 — ti7, May 26 '22 at 20:31
I am using python 2. I can confirm I face the same issue with python 3 as well — Ayush Goyal, May 26 '22 at 20:46
I cannot reproduce the problem: moving to and from pandas gives correct results (Python 3, Spark 3.2.1). — ZygD, May 27 '22 at 08:27
@ZygD yes, I did not check the python version. Somehow I was running python 2 even when i though it was 3. Python 3 gives correct results. Python 2 does not. Please read my last comment on the answer below — Ayush Goyal, May 27 '22 at 11:45

s_pike · Answer 1 · 2022-05-26T21:48:11.930

0

I think the encoding should be fine. To check you could try a word with just regular letters in.

But I think the problem is the data structure itself. Try moving the comma, so data contains a list of one tuple. The parentheses by themselves won't make a tuple, but putting the comma in there will force it into a tuple in the list.

data =[("Citroën",)]

I don't have any issues with pandas understanding these characters - it may just be the way your system is displaying the output. You could test this by converting back to spark and see if it looks the same as before.

Edit - showing pandas working... This works fine for me:

import pandas as pd
print(pd.DataFrame({'car':['Citroën']}))

You could try:

pdf["car"] = pdf["car"].str.decode('utf-8')

edited May 26 '22 at 21:48

answered May 26 '22 at 20:14

s_pike

1,710
1
10
22

Hey. So it turns out I am really dumb or tackling this at 2AM, but anyway, the way I was creating the spark df is not correct. I have updated the queston. Please have a look – Ayush Goyal May 26 '22 at 20:30
Does the pandas dataframe convert back to spark correctly? If it does, it suggests pandas is fine, it's just the printing that's off. – s_pike May 26 '22 at 20:41
No it doesn't. Updated the question – Ayush Goyal May 26 '22 at 20:46
"I don't have any issues with pandas understanding these characters". Could you please elaborate...? – Ayush Goyal May 26 '22 at 20:52
1

Just to add, I can move utf-8 string data between spark and pandas fine with no issues on python 3, so it must be something to do with your local set up/versions. – s_pike May 26 '22 at 21:39
turns out this was a python version issue. With Python 2.7 you get 'Citro\xc3\xabn' if you convert spark df to pandas df. Python 3.5 works fine. If you can edit your answer to reflect the same, I'll accept this. Thanks for the help! – Ayush Goyal May 27 '22 at 10:35

Handling Umlauted characters in PySpark

1 Answers1