0

I am trying to read text file with Chinese chars in pyspark. But data internally treating it as unicode strings and saving / displaying it as unicodes. I want to save them as Chinese chars.

I am using Jupyter notebook Python 2.7 spark : spark-1.6.0-bin-hadoop2.6

Actual Data: "广东省电白建筑工程总公司"|2015-08-05 "广东省阳江市建安集团有限公司"|2015-07-09

Code:
data = sc.textFile("/Users/msr/Desktop/newsData0210.txt") data.take(1)

O/P: u'"\u5e7f\u4e1c\u7701\u7535\u767d\u5efa\u7b51\u5de5\u7a0b\u603b\u516c\u53f8"|2015-08-05'

Please suggest if there is any way to avoid this automatic conversion

Edit: @Alberto Bonsanto .. my terminal can display unicodes. Spark internally converting Chinese string to unicode string. Actually I need to classify data. This automatic conversion causing the problem. Is there any way to stop this automatic conversion.

Resolved: problem got resolved when we updated python to 3.4 from 2.7. Not sure why it was failing for python 2.7. I have tried the options mentioned in other reference posts given in this thread.

msr
  • 1
  • 2
  • If you're at a REPL (interactive prompt), that's just the `repr` of the Unicode string. Try `print`. – nneonneo Mar 02 '16 at 23:22
  • Thanks for quick response...could you please give some example.. is it like >>> d2 = data.map(lambda line : repr(line) ) >>> d2.take(1) .. this is not working – msr Mar 02 '16 at 23:31
  • 1
    Just do `print d2.take(1)`, assuming your terminal is set up for Chinese output. (Note that if you were to write `d2.take(1)` to a file, you'd get Chinese text in that file). – nneonneo Mar 02 '16 at 23:32
  • Thanks again. unfortunately this also not working. I am trying it on Mac and also on ubuntu. Is there something to do with python version? My colleague runs python 3.4 & spark1.6 on windows and she is able to see Chinese chars – msr Mar 02 '16 at 23:36

0 Answers0