19

Loading a dataframe with foreign characters (åäö) into Spark using spark.read.csv, with encoding='utf-8' and trying to do a simple show().

>>> df.show()

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
print(self._jdf.showString(n, truncate))
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 579: ordinal not in range(128)

I figure this is probably related to Python itself but I cannot understand how any of the tricks that are mentioned here for example can be applied in the context of PySpark and the show()-function.

Community
  • 1
  • 1
salient
  • 2,316
  • 6
  • 28
  • 43
  • Do you experience this only when using `show`? – zero323 Sep 23 '16 at 14:13
  • @zero323 are there any other print-related commands that I could try? – salient Sep 23 '16 at 16:18
  • For starters you can try if `df.rdd.map(lambda x: x).count()` succeeds. – zero323 Sep 23 '16 at 16:22
  • @zero323 – Yes, I have even successfully run some Spark SQL-queries — it's only this show()-function that fails on the encoding of the characters in strings. – salient Sep 23 '16 at 16:26
  • So `rdd.take(20)` for example executes without a problem? If so the problem may be a header. One way or another can you provide a minimal data sample which can be used to reproduce the problem? – zero323 Sep 23 '16 at 16:39
  • @zero323 `rdd.take(20)` executes just fine without any issues (however characters such as åäö are in strange Unicode-fashion `\uxxx`). I have kind of isolated that it cannot be a problem with the header as the only column that contains åäö is the one I cannot do show on (verified by iteratively doing `df.select('column_name').show()`) – salient Sep 27 '16 at 15:36
  • @salient I am facing the exact same problem with show(). Were you able to figure out a solution/fix for this? Thanks! – activelearner Mar 07 '18 at 00:53
  • @activelearner Tbh, haven't used Spark since I asked this question, but my guess is that a significant amount of the encoding pain would have gone away had I used python 3. What version are you on? – salient Mar 07 '18 at 00:56

4 Answers4

29

https://issues.apache.org/jira/browse/SPARK-11772 talks about this issue and gives a solution that runs:

export PYTHONIOENCODING=utf8

before running pyspark. I wonder why above works, because sys.getdefaultencoding() returned utf-8 for me even without it.

How to set sys.stdout encoding in Python 3? also talks about this and gives the following solution for Python 3:

import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
Jussi Kujala
  • 901
  • 9
  • 7
8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

This works for me, I am setting the encoding upfront and it is valid throughout the script.

swapnil shashank
  • 877
  • 8
  • 11
  • check this: https://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script – Kardi Teknomo Nov 22 '21 at 02:31
0

I faced the same issue with the following version of Spark and Python:

SPARK - 2.4.0

Python - 2.7.5

None of the above solutions worked for me.

For me, the issue was happening while trying to save the result RDD to HDFS location. I was taking the input from HDFS location and saving the same to HDFS location. Following was the code used for the read and write operations when this issue came up:

Reading input data:

monthly_input = sc.textFile(monthly_input_location).map(lambda i: i.split("\x01"))
monthly_input_df = sqlContext.createDataFrame(monthly_input, monthly_input_schema)

Writing to HDFS:

result = output_df.rdd.map(tuple).map(lambda line: "\x01".join([str(i) for i in line]))
result.saveAsTextFile(output_location)

I changed the reading and writing code respectively to below code:

Reading code:

monthly_input = sqlContext.read.format("csv").option('encoding', 'UTF-8').option("header", "true").option("delimiter", "\x01").schema(monthly_input_schema).load(monthly_input_location)

Writing Code:

output_df.write.format('csv').option("header", "false").option("delimiter", "\x01").save(output_location)

Not only this solved the issue, it improved the IO performance by a great deal(Almost 3 times).

But there are one known issue while using the write logic above, which I am yet to figure out a proper solution. If there are blank field in output, due to the CSV encoding, it will show the blank value enclosed in double quotes("").

For me that issue is currently not a big deal. I am loading the output to hive anyway and there the double quotes can be removed while importing itself.

PS: I am still using SQLContext. Yet to upgrade to SparkSession. But from what I tried so far similar read and write operation in SparkSession based code also will work similarly.

Harikrishnan
  • 7,765
  • 13
  • 62
  • 113
0

One or more of your columns may contain accented words or any other characters of the extended ASCII table.
If you don't mind ignoring those characters and replacing i.e. "ó" with "o", unicodedata should work fine.

Python 2.7 Solution

>> import unicodedata
>> s = u'foóòÒöõþ'
>> s
u'fo\xf3\xf2\xd3\xf6\xf5\xfe'
>> unicodedata.normalize('NFD', s).encode('ASCII', 'ignore')
'foooOoo'

Pyspark 1.6.0 Solution

from pyspark.sql.functions import *
import unicodedata

fix_ascii = udf(
  lambda str_: unicodedata.normalize('NFD', str_).encode('ASCII', 'ignore')
)

df = df.withColumn("column", fix_ascii(col("column")))

# udf will perform the operation defined in each one of the column rows
# you'll get something like this:
# +-----+------------+          +-----+------------+
# |col_A|col_B       |          |col_A|col_B       |
# +-----+------------+          +-----+------------+
# |1    |Tédy example|   -->    |1    |Tedy example|
# |2    |Adàm example|          |2    |Adam example|
# |3    |Tomþ example|          |3    |Tom example |
# +-----+------------+          +-----+------------+