print arabic words and list in python 2.7

Question

I am using anaconda Python 2.7 for Arabic text classification when I print words or list or words it appears in Unicode I want to print the real Arabic words the list contians [Arabic sentence, label]

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
reader = CategorizedPlaintextCorpusReader('mypath\\', r'(\w+)\.txt', cat_pattern=r'(\w+)\.txt',encoding='utf-8')
document=reader.words('fileid')

documen[0]

output

[[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646', ...], 'Politic']

Possible duplicate of [Printing a string prints 'u' before the string in Python?](https://stackoverflow.com/questions/19170808/printing-a-string-prints-u-before-the-string-in-python) — Josh Lee, May 14 '18 at 20:52
Could it be that your console doesn't support unicode? What does `print u'\u0631'` return? Is it `ر` or is it `u'\u0631'`? — Jakob Lovern, May 14 '18 at 21:05
Ah, I see the issue. `print [u'\u0631\u0626\u064a\u0633']` yields the unicode control codes. Interestingly, it seems to output the Arabic characters when run under python 3.6. — Jakob Lovern, May 14 '18 at 21:11

score 0 · Answer 1 · answered May 14 '18 at 21:19

Off the top of my head, I'd assume this is because Python 2.7 was written under ASCII focus (as such, str(u'\u0631') yields a UnicodeEncodeError, as the ر character doesn't exist in ASCII. print u'\u0631' probably works because it's simply sending the unicode straight to the console, which is equipped to handle unicode rendering.

score 0 · Answer 2 · answered May 14 '18 at 23:06

That's the way Python 2 works when you print lists. Print the individual strings or update to Python 3:

Python 2

>>> s = [[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']
>>> print s
[[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']
>>> print s[0][0]
رئيس
>>> print s[0][1]
البرلمان

Python 3

>>> s = [[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']
>>> print(s)
[['رئيس', 'البرلمان'], 'Politic']
>>> print(s[0][0])
رئيس
>>> print(s[0][1])
البرلمان

You get the old behavior with ascii() in Python 3:

>>> print(ascii(s))
[['\u0631\u0626\u064a\u0633', '\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']

print arabic words and list in python 2.7

2 Answers2