1

I have a UTF-8 file with Spanish text, and some words with accent marks are displayed incorrectly in some of the software.

I believe my file is correct. For example, the name 'JESÚS' is encoded as 4A 45 53 C3 9A 53.

>>> b'\x4A\x45\x53\xC3\x9A\x53'.decode('utf-8')
'JESÚS'

c39a is the correct UTF-8 encoding for \u00da, according to http://www.fileformat.info/info/unicode/char/00da/index.htm.

So, why some software renders it incorrectly?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
alexanderlukanin13
  • 4,577
  • 26
  • 29

3 Answers3

5

This is the result of using Latin-1 encoding instead of UTF-8. Two-byte UTF-8 sequence is incorrectly decoded into two characters.

>>> 'Ú'.encode('utf-8').decode('latin-1')
'Ã\x9a'
>>> 'É'.encode('utf-8').decode('latin-1')
'Ã\x89'

http://www.fileformat.info/info/unicode/char/9a/index.htm http://www.fileformat.info/info/unicode/char/89/index.htm

Both of these characters are control characters, so they may or may not be displayed in different software.

Moreover, repeating incorrect encoding-decoding corrupts the text even further:

>> 'Ú'.encode('utf-8').decode('latin-1').encode('utf-8').decode('latin-1')
'Ã\x83Â\x9a'

UPDATE: If you are seeing actual š and ‰ (and not invisible control characters), the wrong encoding is Windows-1252.

Windows-1252 is a superset of ISO 8859-1, with printable characters for 0x80-0x9f.

In Windows-1252 code points 0x9a and 0x89 correspond to characters š and : http://www.fileformat.info/info/unicode/char/0161/index.htm http://www.fileformat.info/info/unicode/char/2030/index.htm

>>> 'Ú'.encode('utf-8').decode('Windows-1252')
'Ú'
>>> 'É'.encode('utf-8').decode('Windows-1252')
'É'
alexanderlukanin13
  • 4,577
  • 26
  • 29
3

You are opening your file in software that decodes the data using a different codec. My guess is that they are opening it in the Windows 1252 codepage. This is resulting in a Mojibake, garbled text.

The UTF-8 codec encodes Unicode codepoints to a variable number of bytes, depending on the character encoded. The first 127 characters of the Unicode standard (corresponding to the ASCII standard) require just one byte, then follow 1920 Latin-1 characters which are encoded to two bytes, etc. all the way up to 4 bytes (UCS allows for up to 6 bytes per codepoint).

Your text contains 2 Latin-1 characters, thus requiring 2 bytes each:

>>> u'Ú and É'.encode('utf8')
'\xc3\x9a and \xc3\x89'

Note how the spaces and the word and are encoded to single bytes (Python displays those as their ASCII codepoints for us because that's more readable than \x.. escape sequences).

Some of your software is decoding that data using a different codec. The CP1252 codec would decode each byte as a single character, so C3 is decoded to Ã, while 9A maps to š and 89 to :

>>> u'Ú and É'.encode('utf8').decode('cp1252')
u'\xc3\u0161 and \xc3\u2030'
>>> print u'Ú and É'.encode('utf8').decode('cp1252')
Ú and É

Note that the ASCII characters in that sample (the spaces and the word and) are not affected, because both UTF-8 and CP1252 use the exact bytes for these; both use ASCII for the first 127 bytes.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

It is unreliable to automatically detect text encodings. Though for us humans, in many cases it's obvious after some practice, whatever program you come up with, can fail for some of the text input. For example, texts discussing the broken encoding of characters, like this page(!)

Hence, many programs working with texts simply do not do autodetection, but rely on the users specifying the encoding.

With Unicode, there is the BOM (Byte Order Mark) that can assist you. In UTF-8, if you start your text with the 8-bit characters 0xEF 0xBB 0xBF, it can help some programs to confirm the encoding of the whole text.

Another large class of programs that interpret HTML text - then you can use the meta tags as shown in the question discussing options:

<meta charset="utf-8"> vs <meta http-equiv="Content-Type">

For all other programs, it's up them - do you have any examples that you would like to make work?

Community
  • 1
  • 1
chexum
  • 195
  • 3
  • 9