UnicodeEncodeError: 'ascii' codec can't encode character error

Question

I am reading some files from google cloud storage using python

spark = SparkSession.builder.appName('aggs').getOrCreate()

df = spark.read.option("sep","\t").option("encoding", "UTF-8").csv('gs://path/', inferSchema=True, header=True,encoding='utf-8')
df.count()
df.show(10)

However, I keep getting an error that complains about the df.show(10) line:

df.show(10)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
350, in show
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 162: ordinal not in range(128)

I googled and found this seems to be a common error and the solution should be added in the encoding of "UTF-8" to the spark.read.option, as I already did. Since this doesn't help, I am still getting this error, could experts help? Thanks in advance.

`# -*- coding: utf-8 -*-` add this string as 1st row to your file — frankegoesdown, Jul 26 '19 at 14:38
https://stackoverflow.com/questions/6289474/working-with-utf-8-encoding-in-python-source — frankegoesdown, Jul 26 '19 at 14:40
@frankegoesdown I added that line to the first line of my py, and still get this same error. — Kevin, Jul 26 '19 at 15:18

Giorgos Myrianthous · Accepted Answer · 2019-07-26T15:42:46.300

4

How about exporting PYTHONIOENCODING before running your Spark job:

export PYTHONIOENCODING=utf8

For Python 3.7+ the following should also do the trick:

sys.stdout.reconfigure(encoding='utf-8')

For Python 2.x you can use the following:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

edited Jul 26 '19 at 15:42

answered Jul 26 '19 at 14:38

Giorgos Myrianthous

36,235
20
134
156

2

This should work. `df.show()` uses the good old `print` function, which will try to encode the output data to whatever you system default is before piping it to stdout. Make sure you set this environment variable before starting the python interpreter. – Håken Lid Jul 26 '19 at 14:53
@Giorgos Myrianthous. Hi, I did the export and I've checked using sys.stdin.encoding; sys.stdout.encoing; sys.stderr.encoding, they all output "utf-8". However, I am still getting the same error as before. Maybe I should do more? – Kevin Jul 26 '19 at 15:17
@Kevin How about `import sys` `reload(sys)` `sys.setdefaultencoding('utf-8')` ? – Giorgos Myrianthous Jul 26 '19 at 15:20

UnicodeEncodeError: 'ascii' codec can't encode character error

1 Answers1