Why is python converting Kurdich characters into UTF-8 literals?

Question

I was trying to take the content of a text file and map it into a json file, but I noticed that python automatically turned the kurdish(sorani) text into UTF-8 literals. Can someone explain why python does this and how can I prevent the conversion?

You can test it with the code below:

def readText():
    # test.txt contains kurdish sorani characters (an article)
    # Sorani example: ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە. 
    with open('test.txt', 'r') as context:
        data = context.readlines()
        return data
print(readText())

I'm running python 2.x on Ubuntu 14.x. Python2.x does this! Python 3.x does not convert it and works just fine.

You might solve your problem by checking out `PYTHONIOENCODING` environment variable: https://docs.python.org/2/using/cmdline.html — Kyle Pittman, Jan 01 '16 at 21:12
Where is the output, in a terminal? What version of Python is it being run with? Also, see my answer for a possibility. — Kyle Pittman, Jan 01 '16 at 21:20
@user3419211, you are sure you are using python3 ? because you should not be seeing repr representation using python3.4 — Padraic Cunningham, Jan 01 '16 at 21:32
Are you on Windows? Maybe you don't have the right code page. What is `sys.stdout.encoding`. python will encode to that value when printing. — tdelaney, Jan 01 '16 at 21:46

Padraic Cunningham · Answer 1 · 2016-01-01T21:44:49.923

0

You are seeing the repr output as you call readlines which returns a list and lists show the repr representation of your data, once you actually print the strings themselves you will see the actual str output, you are also using python2:

In [11]: out = readText()

In [12]: print out
['\xda\x95\xdb\x86\xda\x98\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xa8\xd8\xa7\xd8\xb4 \xd8\xa8\xdb\x95\xda\x95\xdb\x8e\xd8\xb2\xd8\xa7\xd9\x86. \xd9\x85\xd9\x86 \xd9\x86\xd8\xa7\xd9\x88\xd9\x85 \xda\x95\xdb\x95\xd9\x86\xd8\xac\xdb\x95. ']

In [13]: print out[0]
ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە.

edited Jan 01 '16 at 21:44

answered Jan 01 '16 at 21:19

Padraic Cunningham

176,452
29
245
321

Yes this is only true when I give it an index. This does not work when I read an article from text file A and write it into text file or json file B – Ranj Jan 01 '16 at 21:25

score -1 · Answer 2 · answered Jan 01 '16 at 21:19

-1

I'm going to take a stab here and guess that you are reading the output in a terminal of some sort, and when Python writes to the terminal it's trying to display in ASCII.

If you set your PYTHONIOENCODING environment variable to UTF-8 this can sometimes solve the issue - it depends on other variables as well.

So, if you're on a UNIX-like system, try this in your terminal: export PYTHONIOENCODING=UTF-8

Or, for Windows, set PYTHONIOENCODING=UTF-8.

Then, try running your script again and see if you get the correct characters printed.

More information can be found here: How to print UTF-8 Encoded Text to the console in Python3

answered Jan 01 '16 at 21:19

Kyle Pittman

2,858
1
30
38

Care to explain downvotes? I don't see why that's really necessary. – Kyle Pittman Jan 02 '16 at 03:52
I didn't downvote you but here's two guesses - 1) Your answer doesn't solve the question. 2) By your own admittance, it's a total guess. We ought to understand issues before offering solutions – Alastair McCormack Jan 03 '16 at 14:25
@AlastairMcCormack Thanks for the constructive feedback, I'll keep my answer here as it may help point someone in the right direction in the future. – Kyle Pittman Jan 03 '16 at 18:02

Why is python converting Kurdich characters into UTF-8 literals?

2 Answers2