Python storing Arabic in an Array?

Question

Im using python V2.7, I have an array ArbSyn that has arabic string but they are stored as unicode, I want to convert them to normal Arabic letter and store them in the array ArbSynFinal. When I print encoded its printed in arabic letters but when I store it in ArbSynFinal using ArbSynFinal.append() and print it, its in unicode again, how can I solve this problem ?

 print("----ArbSyn----")
print ArbSyn
ArbSynFinal=[]
for bca in ArbSyn: #Converting from unicode to arabic done
    encoded=bca.encode('utf-8')#this works fine
    encoded= u"".join([c for c in bca if not unicodedata.combining(c)])
    print encoded
    ArbSynFinal.append(encoded)
print("------Arb Syn Final----------")
print ArbSynFinal

This is the output:

----ArbSyn----
[u'\u0627\u0642\u062a\u0631\u062d', u'\u0627\u062d\u062f\u0627\u062b', u'\u0645\u0648\u0633\u0633', u'\u0631\u0627\u062f', u'\u062a\u0633\u064a\u0633', u'\u0627\u062d\u062f\u0627\u062b',]
اقترح
احداث
موسس
راد
تسيس
احداث

------Arb Syn Final----------
[u'\u0627\u0642\u062a\u0631\u062d', u'\u0627\u062d\u062f\u0627\u062b', u'\u0645\u0648\u0633\u0633', u'\u0631\u0627\u062f', u'\u062a\u0633\u064a\u0633', u'\u0627\u062d\u062f\u0627\u062b']

What operating system and console are you using to print stdout? It matters. See [this](http://stackoverflow.com/a/3259271/5221082) for Windows — Bob Dylan, Feb 17 '16 at 18:53

score 3 · Accepted Answer · edited Feb 19 '16 at 16:42

Printing lists uses the repr() of the items in the list, which always shows Unicode escapes on Python 2. Switch to Python 3 and lists will display (printable) Unicode characters or build your own representation of a list. Always print Unicode strings directly to the terminal, without attempting to encode them. If the terminal supports the characters, it will display properly, regardless if the terminal is using UTF-8 or an Arabic legacy encoding like Windows-1256:

#!python2
ArbSyn = [u'\u0627\u0642\u062a\u0631\u062d', u'\u0627\u062d\u062f\u0627\u062b', u'\u0645\u0648\u0633\u0633', u'\u0631\u0627\u062f', u'\u062a\u0633\u064a\u0633', u'\u0627\u062d\u062f\u0627\u062b']

# Demonstrate the difference printing an item vs. its representation
for item in ArbSyn:
    print item,repr(item)

# Build a Unicode string representation of a list
as_list = u"['" + u"', '".join(ArbSyn) + u"']"
print as_list

Output:

اقترح u'\u0627\u0642\u062a\u0631\u062d'
احداث u'\u0627\u062d\u062f\u0627\u062b'
موسس u'\u0645\u0648\u0633\u0633'
راد u'\u0631\u0627\u062f'
تسيس u'\u062a\u0633\u064a\u0633'
احداث u'\u0627\u062d\u062f\u0627\u062b'
['اقترح', 'احداث', 'موسس', 'راد', 'تسيس', 'احداث']

Python 3:

#!python3
ArbSyn = ['\u0627\u0642\u062a\u0631\u062d', '\u0627\u062d\u062f\u0627\u062b', '\u0645\u0648\u0633\u0633', '\u0631\u0627\u062f', '\u062a\u0633\u064a\u0633', '\u0627\u062d\u062f\u0627\u062b']
print(ArbSyn)

Output:

['اقترح', 'احداث', 'موسس', 'راد', 'تسيس', 'احداث']

If you declare the encoding of your source file, you can directly enter the Arabic characters in the source as well. You still get the repr() printing a list on Python 2 and still have to build a Unicode string for the list if you want to print it properly.

#!python2
#coding:utf8
ArbSyn = [u'اقترح', u'احداث', u'موسس', u'راد', u'تسيس', u'احداث']
print ArbSyn
print u"['" + u"', '".join(ArbSyn) + u"']"

Output:

[u'\u0627\u0642\u062a\u0631\u062d', u'\u0627\u062d\u062f\u0627\u062b', u'\u0645\u0648\u0633\u0633', u'\u0631\u0627\u062f', u'\u062a\u0633\u064a\u0633', u'\u0627\u062d\u062f\u0627\u062b']
['اقترح', 'احداث', 'موسس', 'راد', 'تسيس', 'احداث']

Thats amazing, at the beginning I was working with Python 3 and everything was perfect but I had to use Arabic Wordnet which is only available on Python 2,7 so Im stuck with V2.7 but thank you that really helped! — IS92, Feb 18 '16 at 17:13
@I.Abdelsalam: you can print Unicode on Python 2 too. Just avoid `repr()` (explicit or implicit) e.g., to print a list of Unicode strings on Python 2 so that the characters are shown instead of Unicode escapes one per line: `print "\n".join(your_list)` — jfs, Feb 19 '16 at 16:40

sabbahillel · Answer 2 · 2016-02-18T11:00:48.080

Note that this is Python 2.7

This is because the ArbSynFinal is using the default output encoding when you do the print. As a result, you need to use (as you found in the question)

print ArbSynFinal.encode('utf-8')

However, if you want to avoid having to do this every time, you can create a function myprint(output) and call it whenever you want to do a print.

def myprint(text):
    print text.encode('utf-8')

myprint(output)

Python: How is sys.stdout.encoding chosen? has an example of resetting the default encoding.

import sys
import codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

This appears to work properly for a basic test. However, I do not have access to that site.

Another possibility is to change the environment variable "PYTHONIOENCODING" to "utf_8." This will reset sys.stdout.encoding

import sys
print sys.stdout.encoding

I also found this, but I do not know if it works. I had been unable to get to the reference that proves this does not work.

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

Thanks to @MarkTolonen for pointing out that setdefaultencoding breaks code and will not work.

No, printing lists always uses the `repr()`. If you want to see the characters in a list, you have to print the elements of a list or build your own Unicode string out of the list and print it. The reload trick is never recommended: [setdefaultencoding breaks code](https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/). Hard-coding output to a particular encoding breaks on terminals that don't default to that encoding. — Mark Tolonen, Feb 18 '16 at 02:59
@MarkTolonen Thanks. as I said I found the reference but was unable to test if it works. — sabbahillel, Feb 18 '16 at 10:54

Python storing Arabic in an Array?

2 Answers2