How to use UTF-8 in python when user input Persian or Arabic

Question

My code is about getting names of some users and some points about them and write them into a text file as table. For example:

19(someSpace*)|حمید

20(someSpace*)|وحید

70(someSpace*)|خلیل

14(someSpace*)|Hamid

def roww(STR, n):
    if n >= 10:
        return str(n) +4*" " +"| " + STR + "\n"
    else:
        return str(n) +5*" " +"|" + STR + "\n"

def my_table(STR, m):
    import sys  
    reload(sys)  
    sys.setdefaultencoding('utf8')
    import codecs as D
    f = D.open(STR + '.txt', "w", encoding='utf-8')
    i = 0
    while(i < m):
        i +=1
        a = raw_input("Name: ").encode('utf-8')
        b = raw_input("Grade: ")
        f.write(roww(a,b))
    f.close()

when execute:

my_table("grade",3)
Name: حمید

I get this error:

    UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-a48ceb393d9a> in <module>()
----> 1 my_table("grade",3)

<ipython-input-9-6a83996822a3> in my_table(STR, m)
     14     while(i < m):
     15         i +=1
---> 16         a = raw_input("Name: ").encode('utf-8')
     17         b = raw_input("Grade")
     18         f.write(roww(a,b))

C:\Users\Hamid\Anaconda2\lib\encodings\utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xcd in position 0: invalid continuation byte

I can't solve my problem with python about utf-8. Also I can't find any useful answer.

Have you put the line `# -*- coding: utf-8 -*-` at the very top of your file? — Arne, Nov 30 '17 at 08:23
Check up on this post, it should contain all the information you need: https://sopython.com/canon/88/how-do-i-use-non-ascii-strings-in-my-python-script/ — Arne, Nov 30 '17 at 08:25
Putting the encoding at the top of the file will not help you in this particular case. It only determines the encoding of the *source file itself*. Yours is a runtime problem. Due to how Python 2 "works", `raw_input("Name: ").encode('utf-8')` is actually equivalent to `raw_input("Name: ").decode('utf-8').encode('utf-8')` and it would seem that your input bytes do not contain UTF-8 encoded text. — Ilja Everilä, Nov 30 '17 at 08:28
This might be of use: https://stackoverflow.com/questions/477061/how-to-read-unicode-input-and-compare-unicode-strings-in-python. The trick is to try and determine the encoding used by your terminal, if using your application through such. That boils down to: you should always know your input's encoding. — Ilja Everilä, Nov 30 '17 at 08:37
I add # -*- coding: utf-8 -*- just before "when execute" but it doesn't help! I get the same error. — Hamid Shafie Asl, Nov 30 '17 at 08:41
@ Ilja Everilä the link you gave was useful. But new problem is some symbol "?" instead of some letter in file. — Hamid Shafie Asl, Nov 30 '17 at 08:47

user1767754 · Answer 1 · 2017-11-30T16:59:41.520

1

Here is a simple example, which demonstrates how you would read / decode and write an arabic text file. As Ilja already pointed out, it basically depends on your terminal, and as you are getting from the terminal an already utf-8 encoded bytecode, you actually have to decode it.

Works fine on MacOSx:

If you run this snippet and give as input المدرالمدرالمدر it will work fine.

# -*- coding: utf-8 -*-
x = raw_input("test: ").decode("utf-8")
print x
f = open("testarabic.txt", "w")
f.write(x.encode("utf-8"))
f.close()

The first line# -*- coding: utf-8 -*-is not needed in your case, unless you have some static text in your python file.

Workaround on Windows

I've just gave it a try on windows and as you said, i am getting as well ????? characters, but i realized that the windows command line is using a different encoding. So the first question is, if your command line already shows arabic characters correctly, than figure out what codepage it is using by typing in terminal

1) Get Code Page of your Terminal

chcp
Active code page:1256

2) Use the active code you got in your terminal to decode your raw-input

x = raw_input("test: ").decode("1256")

If your command line on windows wasn't showing arabic signs correctly you can set it by typing in the windows command line chcp 1256

edited Nov 30 '17 at 16:59

answered Nov 30 '17 at 08:28

user1767754

23,311
18
141
164

I ran your code and I got this: UnicodeDecodeError: 'utf8' codec can't decode byte 0xc7 in position 0: invalid continuation byte – Hamid Shafie Asl Nov 30 '17 at 08:58
That's weird, what terminal are you using? Windows? Linux? – user1767754 Nov 30 '17 at 09:05
I use Windows 7.1 – Hamid Shafie Asl Nov 30 '17 at 09:07
Hmm...that might be a Windows thingy...i don't have a windows machine currently to test...maybe in 8 hours. – user1767754 Nov 30 '17 at 09:19
This post has some discussion on windwos+Farsi issues, but since the error arise at decoding level, and not during a print statement, I assumed that might not be it. https://stackoverflow.com/questions/39528462/python-3-print-function-with-farsi-arabic-characters – Arne Nov 30 '17 at 09:34
Thank you for your answer! 1256 doesn't work for some letters and again I see "?". In my command chcp is 720. I put 720 in my code but it give me error. In my command line I can't write Persian and it shows unknown symbols. – Hamid Shafie Asl Dec 02 '17 at 11:37

How to use UTF-8 in python when user input Persian or Arabic

1 Answers1