0

I was trying to take the content of a text file and map it into a json file, but I noticed that python automatically turned the kurdish(sorani) text into UTF-8 literals. Can someone explain why python does this and how can I prevent the conversion?

You can test it with the code below:

def readText():
    # test.txt contains kurdish sorani characters (an article)
    # Sorani example: ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە. 
    with open('test.txt', 'r') as context:
        data = context.readlines()
        return data
print(readText())

I'm running python 2.x on Ubuntu 14.x. Python2.x does this! Python 3.x does not convert it and works just fine.

Ranj
  • 718
  • 1
  • 12
  • 19

2 Answers2

0

You are seeing the repr output as you call readlines which returns a list and lists show the repr representation of your data, once you actually print the strings themselves you will see the actual str output, you are also using python2:

In [11]: out = readText()

In [12]: print out
['\xda\x95\xdb\x86\xda\x98\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xa8\xd8\xa7\xd8\xb4 \xd8\xa8\xdb\x95\xda\x95\xdb\x8e\xd8\xb2\xd8\xa7\xd9\x86. \xd9\x85\xd9\x86 \xd9\x86\xd8\xa7\xd9\x88\xd9\x85 \xda\x95\xdb\x95\xd9\x86\xd8\xac\xdb\x95. ']

In [13]: print out[0]
ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە. 
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Yes this is only true when I give it an index. This does not work when I read an article from text file A and write it into text file or json file B – Ranj Jan 01 '16 at 21:25
-1

I'm going to take a stab here and guess that you are reading the output in a terminal of some sort, and when Python writes to the terminal it's trying to display in ASCII.

If you set your PYTHONIOENCODING environment variable to UTF-8 this can sometimes solve the issue - it depends on other variables as well.

So, if you're on a UNIX-like system, try this in your terminal: export PYTHONIOENCODING=UTF-8

Or, for Windows, set PYTHONIOENCODING=UTF-8.

Then, try running your script again and see if you get the correct characters printed.

More information can be found here: How to print UTF-8 Encoded Text to the console in Python3

Kyle Pittman
  • 2,858
  • 1
  • 30
  • 38
  • Care to explain downvotes? I don't see why that's really necessary. – Kyle Pittman Jan 02 '16 at 03:52
  • I didn't downvote you but here's two guesses - 1) Your answer doesn't solve the question. 2) By your own admittance, it's a total guess. We ought to understand issues before offering solutions – Alastair McCormack Jan 03 '16 at 14:25
  • @AlastairMcCormack Thanks for the constructive feedback, I'll keep my answer here as it may help point someone in the right direction in the future. – Kyle Pittman Jan 03 '16 at 18:02