I am trying to read text file with Chinese chars in pyspark. But data internally treating it as unicode strings and saving / displaying it as unicodes. I want to save them as Chinese chars.
I am using Jupyter notebook Python 2.7 spark : spark-1.6.0-bin-hadoop2.6
Actual Data: "广东省电白建筑工程总公司"|2015-08-05 "广东省阳江市建安集团有限公司"|2015-07-09
Code:
data = sc.textFile("/Users/msr/Desktop/newsData0210.txt")
data.take(1)
O/P: u'"\u5e7f\u4e1c\u7701\u7535\u767d\u5efa\u7b51\u5de5\u7a0b\u603b\u516c\u53f8"|2015-08-05'
Please suggest if there is any way to avoid this automatic conversion
Edit: @Alberto Bonsanto .. my terminal can display unicodes. Spark internally converting Chinese string to unicode string. Actually I need to classify data. This automatic conversion causing the problem. Is there any way to stop this automatic conversion.
Resolved: problem got resolved when we updated python to 3.4 from 2.7. Not sure why it was failing for python 2.7. I have tried the options mentioned in other reference posts given in this thread.