Python print to terminal shell unicode

Question

I am parsing a long string of persian in python, and am opening it like this:

fp = codecs.open(f+i, 'r', encoding='utf-8').readlines()

and using

print(line[1])

but instead of printing out readable Persian, it outputs things like this in the terminal.

Ø§Ø·Ù
     Ø§Ø¹âØ±Ø³Ø§Ù

On the webpage, it outputs it fine.

What is the issue with it? Thank you

What does `import sys; sys.stdout.encoding` show? Is that correct for your console / terminal? — Martijn Pieters, May 01 '14 at 20:11
@MartijnPieters I've used that before for a previous python script and it worked, so I'm not sure why it's not working for me this time with the same terminal — user3325170, May 01 '14 at 20:14
@MartijnPieters well, specifically I did import sys and sys.stdout = codecs.getreader('utf-8')(sys.stdout) — user3325170, May 01 '14 at 20:16
@user3325170: oh, that was perhaps not the best idea? You do get UTF-8 output, but whatever is *reading* those UTF-8 bytes is printing them as Latin 1 instead. — Martijn Pieters, May 01 '14 at 20:20
@user3325170: are you looking at this text on a Windows machine perhaps? — Martijn Pieters, May 01 '14 at 20:22
@MartijnPieters Yes it's on a windows. Ok I will reply more to your comment below! — user3325170, May 01 '14 at 20:23
check [this](http://stackoverflow.com/q/39528462/5284370) out. — Soorena, Sep 18 '16 at 13:18

Martijn Pieters · Accepted Answer · 2014-05-01T20:39:03.747

You have a CP1252 Mojibake here. The first character is the code point U+0627 ARABIC LETTER ALEF, encoded to UTF-8, but then interpreted as CP1252:

>>> print u'\u0627'.encode('utf8').decode('cp1252')
Ø§

Your SSH shell is misconfigured somewhere; the remote shell thinks you are using UTF-8, while locally the printed UTF-8 bytes are being printed as if they were CP1252 bytes.

What I can decipher is:

The Ù character is a Mojibake starting point for anything in the U+640 to U+0660 range; we cannot see the second byte for the two occurrences here. Ditto for the â character; the second byte wasn't printable in CP1252 so it is again missing.

Overall, what I can recover is:

>>> print u'Ø§Ø· - Ø§Ø¹ - Ø±Ø³Ø§'.encode('cp1252').decode('utf8')
اط - اع - رسا

Thank you, I would not have known this if you didn't let me know. I will have to look more into this and get back to you about it. Perhaps tweek some things with my shell...will let you know! — user3325170, May 01 '14 at 20:24
@user3325170: it is CP1252, the Windows Latin-1 approximate codepage — Martijn Pieters, May 01 '14 at 20:31

Python print to terminal shell unicode

1 Answers1