0

I am using anaconda Python 2.7 for Arabic text classification when I print words or list or words it appears in Unicode I want to print the real Arabic words the list contians [Arabic sentence, label]

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
reader = CategorizedPlaintextCorpusReader('mypath\\', r'(\w+)\.txt', cat_pattern=r'(\w+)\.txt',encoding='utf-8')
document=reader.words('fileid')

documen[0]

output

[[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646', ...], 'Politic']

Ahmed
  • 23
  • 5

2 Answers2

0

Off the top of my head, I'd assume this is because Python 2.7 was written under ASCII focus (as such, str(u'\u0631') yields a UnicodeEncodeError, as the ر character doesn't exist in ASCII. print u'\u0631' probably works because it's simply sending the unicode straight to the console, which is equipped to handle unicode rendering.

Jakob Lovern
  • 1,301
  • 7
  • 24
0

That's the way Python 2 works when you print lists. Print the individual strings or update to Python 3:

Python 2

>>> s = [[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']
>>> print s
[[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']
>>> print s[0][0]
رئيس
>>> print s[0][1]
البرلمان

Python 3

>>> s = [[u'\u0631\u0626\u064a\u0633', u'\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']
>>> print(s)
[['رئيس', 'البرلمان'], 'Politic']
>>> print(s[0][0])
رئيس
>>> print(s[0][1])
البرلمان

You get the old behavior with ascii() in Python 3:

>>> print(ascii(s))
[['\u0631\u0626\u064a\u0633', '\u0627\u0644\u0628\u0631\u0644\u0645\u0627\u0646'], 'Politic']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251