You are opening your file in software that decodes the data using a different codec. My guess is that they are opening it in the Windows 1252 codepage. This is resulting in a Mojibake, garbled text.
The UTF-8 codec encodes Unicode codepoints to a variable number of bytes, depending on the character encoded. The first 127 characters of the Unicode standard (corresponding to the ASCII standard) require just one byte, then follow 1920 Latin-1 characters which are encoded to two bytes, etc. all the way up to 4 bytes (UCS allows for up to 6 bytes per codepoint).
Your text contains 2 Latin-1 characters, thus requiring 2 bytes each:
>>> u'Ú and É'.encode('utf8')
'\xc3\x9a and \xc3\x89'
Note how the spaces and the word and
are encoded to single bytes (Python displays those as their ASCII codepoints for us because that's more readable than \x..
escape sequences).
Some of your software is decoding that data using a different codec. The CP1252 codec would decode each byte as a single character, so C3
is decoded to Ã
, while 9A
maps to š
and 89
to ‰
:
>>> u'Ú and É'.encode('utf8').decode('cp1252')
u'\xc3\u0161 and \xc3\u2030'
>>> print u'Ú and É'.encode('utf8').decode('cp1252')
Ú and É
Note that the ASCII characters in that sample (the spaces and the word and
) are not affected, because both UTF-8 and CP1252 use the exact bytes for these; both use ASCII for the first 127 bytes.