2

I am parsing a long string of persian in python, and am opening it like this:

fp = codecs.open(f+i, 'r', encoding='utf-8').readlines()

and using

print(line[1])

but instead of printing out readable Persian, it outputs things like this in the terminal.

اطÙ
     Ø§Ø¹âØ±Ø³Ø§Ù

On the webpage, it outputs it fine.

What is the issue with it? Thank you

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user3325170
  • 133
  • 10

1 Answers1

4

You have a CP1252 Mojibake here. The first character is the code point U+0627 ARABIC LETTER ALEF, encoded to UTF-8, but then interpreted as CP1252:

>>> print u'\u0627'.encode('utf8').decode('cp1252')
ا

Your SSH shell is misconfigured somewhere; the remote shell thinks you are using UTF-8, while locally the printed UTF-8 bytes are being printed as if they were CP1252 bytes.

What I can decipher is:

The Ù character is a Mojibake starting point for anything in the U+640 to U+0660 range; we cannot see the second byte for the two occurrences here. Ditto for the â character; the second byte wasn't printable in CP1252 so it is again missing.

Overall, what I can recover is:

>>> print u'اط - اع - رسا'.encode('cp1252').decode('utf8')
اط - اع - رسا
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you, I would not have known this if you didn't let me know. I will have to look more into this and get back to you about it. Perhaps tweek some things with my shell...will let you know! – user3325170 May 01 '14 at 20:24
  • 1
    @user3325170: it is CP1252, the Windows Latin-1 approximate codepage – Martijn Pieters May 01 '14 at 20:31