Getting unicodeerror while running the below program while trying to insert the data into the Oracle DB.
# -*- coding: utf-8 -*-
#import unicodedata
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import sys
print(sys.getdefaultencoding())
u = 'abcdé'
a = 'Austròalia'
print(u)
print(a)
spark = SparkSession.builder.master("local") \
.appName("Unicode_Error") \
.getOrCreate()
sqlContext = SQLContext(spark)
l = [(340, 'India',1),(340, 'Canada',2),(341, u'abcdé',3),(340, 'Japan',4),(341, u'Austròalia',5),(341, 'China',6)]
df = sqlContext.createDataFrame(l, ['CUSTOMER_ID', 'COUNTRY', 'LINENUMBER'])
df.show()
data_tuples = [tuple(x) for x in df.rdd.collect()]
print(str(data_tuples))
print(type(data_tuples))
query = "INSERT INTO CUSTOMERS VALUES (:1, :2, :3)"
cur = con.cursor()
cur.prepare(query)
cur.executemany(None, data_tuples)
con.commit()
cur.close()
con.close()
Had set the PYTHONIOENCODING=utf8
before submitting the Spark job which solved the issues with the dataframe.show(). and also # -*- coding: utf-8 -*-
helped with resolving the python print statements.
Though now I am getting an error even after the dataframe displays the data correctly. The conversion of the dataframe into list is where the issue tends to happen, could you please advise what else needs to be done.
ascii
abcdé
Austròalia
+-----------+----------+----------+
|CUSTOMER_ID| COUNTRY|LINENUMBER|
+-----------+----------+----------+
| 340| India| 1|
| 340| Canada| 2|
| 341| abcdé| 3|
| 340| Japan| 4|
| 341|Austròalia| 5|
| 341| China| 6|
+-----------+----------+----------+
[(340, u'India', 1), (340, u'Canada', 2), (341, u'abcd\xe9', 3), (340, u'Japan', 4), (341, u'Austr\xf2alia', 5), (341, u'China', 6)]
<type 'list'>
> Traceback (most recent call last): cur.executemany(None, data_tuples)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
> position 4: ordinal not in range(128)
The tuple list has unicode data and the usage of encode
was not possible on the same, but printing out each element in the tuple list gave me the exact output as below
[('340', "u'India'", '1'), ('340', "u'Canada'", '2'), ('341', "u'abcd\\xe9'", '3'), ('340', "u'Japan'", '4'), ('341', "u'Austr\\xf2alia'", '5'), ('341', "u'China'", '6')]
***********************
India
340
India
1
340
Canada
2
341
abcdé
3
340
Japan
4
341
Austròalia
5
341
China
6