0

I am having some problems with accents.

I did a python script that are getting the word "refeição" from some input (IMAP fetch), this word is in Portuguese and I need convert it to be human readable. After decode, it should appear like "refeição" but I am not getting this result...

>>> print a 
refeição
>>> ENCODING = locale.getpreferredencoding()
>>> print ENCODING
UTF-8
>>> print a.encode(ENCODING)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> a.decode('utf-8')
u'refei\xe7\xe3o'
>>> print a.decode('utf-8')
refeição

Updated:

root@ticuna:/etc/scripts# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Also, theses words are inserted in a mysql database and the "unreadable" characters are showing in the same way that is in terminal. The table collation is utf8_general_ci

Thomas
  • 2,256
  • 6
  • 32
  • 47

2 Answers2

2

It looks like your terminal window displays text in the single-byte ISO-8859-1 charset, ("latin-1"), but your python interpreter thinks the terminal is speaking UTF-8. We can see from u'refei\xe7\xe3o' that Python has the correct internal representation of the Portugese letters. Apparently, the print command then converts the internal representation to UTF-8 and sends it to your terminal, which produces gibberish when the terminal interprets that UTF-8 as ISO-8859-1.

The fix is to make your locale match what your terminal is doing -- either by changing the locale or by making sure your terminal is utf-8.

hmakholm left over Monica
  • 23,074
  • 3
  • 51
  • 73
  • Hello Henning, my terminal is configured to use utf-8: root@ticuna:/etc/scripts# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= – Thomas Aug 06 '11 at 17:24
  • @Thomas: That does not show how your terminal is configured, only how applications running in that shell will behave. – Ignacio Vazquez-Abrams Aug 06 '11 at 17:38
  • @Ignacio, so what is the clue? As I updated my question, I am having the same problem when this text is inserted in mysql. – Thomas Aug 06 '11 at 17:45
  • The classical clue for "utf-8 bytes being interpreted as latin-1", when the underlying text uses Latin script, is an accented uppercase vowel followed by strange punctuation. Your output matches that perfectly. So Python and the database are both behaving perfectly according to the locale you have set, but you have set the locale wrong for the terminal emulator you're using. The function of the locale is to _inform_ programs how the terminal behaves -- it doesn't actually _control_ the terminal's behavior. – hmakholm left over Monica Aug 06 '11 at 18:12
  • So, what is your advice to show accented vowel in terminal and database? I tried to change table collation to latin1 and I still with the same problem. How can I change terminal behaviour to understand "latin-1" ? – Thomas Aug 08 '11 at 13:11
  • Your problem is that the terminal _already_ displays latin-1, while you're telling your programs to send it UTF-8 data. The table collation in the database is (or should be, in any sane world) irrelevant. Just change your locale to ISO-8859-1 instead of UTF-8 already, and all should be fine. – hmakholm left over Monica Aug 08 '11 at 13:17
0

As work around, I am removing all accents.

Here is the code that I used:

def remove_accents(s):
   return ''.join((c for c in unicodedata.normalize('NFD', s.decode('utf-8')) if unicodedata.category(c) != 'Mn'))

Based in this answer: What is the best way to remove accents in a Python unicode string?

Community
  • 1
  • 1
Thomas
  • 2,256
  • 6
  • 32
  • 47