1

I am reading some files from google cloud storage using python

spark = SparkSession.builder.appName('aggs').getOrCreate()

df = spark.read.option("sep","\t").option("encoding", "UTF-8").csv('gs://path/', inferSchema=True, header=True,encoding='utf-8')
df.count()
df.show(10)

However, I keep getting an error that complains about the df.show(10) line:

df.show(10)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 
350, in show
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 162: ordinal not in range(128)

I googled and found this seems to be a common error and the solution should be added in the encoding of "UTF-8" to the spark.read.option, as I already did. Since this doesn't help, I am still getting this error, could experts help? Thanks in advance.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
Kevin
  • 6,711
  • 16
  • 60
  • 107

1 Answers1

4

How about exporting PYTHONIOENCODING before running your Spark job:

export PYTHONIOENCODING=utf8

For Python 3.7+ the following should also do the trick:

sys.stdout.reconfigure(encoding='utf-8')

For Python 2.x you can use the following:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
  • 2
    This should work. `df.show()` uses the good old `print` function, which will try to encode the output data to whatever you system default is before piping it to stdout. Make sure you set this environment variable before starting the python interpreter. – Håken Lid Jul 26 '19 at 14:53
  • @Giorgos Myrianthous. Hi, I did the export and I've checked using sys.stdin.encoding; sys.stdout.encoing; sys.stderr.encoding, they all output "utf-8". However, I am still getting the same error as before. Maybe I should do more? – Kevin Jul 26 '19 at 15:17
  • @Kevin How about `import sys` `reload(sys)` `sys.setdefaultencoding('utf-8')` ? – Giorgos Myrianthous Jul 26 '19 at 15:20