Pyspark - UnicodeEncodeError: 'ascii' codec can't encode character

Question

Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.

# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())

u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)

spark = SparkSession.builder.master("local") \
        .appName("Unicode_Error") \
        .getOrCreate()

sqlContext = SQLContext(spark)

l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()

data_tuples = [tuple(x) for x in df.rdd.collect()]

print(str(data_tuples))

print(type(data_tuples))

query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()

Had set the PYTHONIOENCODING=utf8 before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*- helped with resolving the python print statements.

Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.

ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID|   COUNTRY|LINENUMBER|
+-----------+----------+----------+
|        340|     India|         1|
|        340|    Canada|         2|
|        341|     abcdé|         3|
|        340|     Japan|         4|
|        341|Austròalia|         5|
|        341|     China|         6|
+-----------+----------+----------+

[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcd\xe9', 3), (340, u'Japan', 4), (341, u'Austr\xf2alia', 5), (341, u'China', 6)]
<type 'list'>

> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
> position 4: ordinal not in range(128)

The tuple list has unicode data and the usage of encode was not possible on the same, but printing out each element in the tuple list gave me the exact output as below

[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6

Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) — 10465355, Nov 23 '18 at 10:56
@user10465355 had tried the `encode` but that doesnt seem to support tuples, can as you can see in the output, its the tuples list that has the unicode character. can you throw some light on how to achieve it, maybe i was doing it the wrong way — DataWrangler, Nov 23 '18 at 11:28
For python 2.7, the only way to fix would be to replace() occurrences of those strings. — pvy4917, Nov 23 '18 at 15:26
@karma4917 that would be a bit tedious right as we cannot assure the data that comes over, do you have any option handy if so can you share....? — DataWrangler, Nov 26 '18 at 15:05
@user10465355 can the question be removed from the duplicate tag, so that I can post over a solution that helped me solve the issue. — DataWrangler, Nov 26 '18 at 15:06
Pending close (if there are any, I don't have a privilege to see this) vote shouldn't interfere with your ability to post an answer at the moment. — 10465355, Nov 26 '18 at 17:15

score 1 · Accepted Answer · answered Nov 27 '18 at 07:15

This was resolved by passing additional params while connecting to Oracle via cx_Oracle.

Set the encoding method for the python environment to support the Unicode data handling

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Supply the encoding properties in the cx_Oracle connect

con = cx_Oracle.connect(connection_string, encoding = "UTF-8", nencoding = "UTF-8")

You can refer the https://github.com/oracle/python-cx_Oracle/issues/36 to get more idea on the same.

Pyspark - UnicodeEncodeError: 'ascii' codec can't encode character

1 Answers1