0

I have a file that contains a unicode string: u"L'\xe9quipe le quotidien"

I have another file, exported from Windows and encoded as iso-8859-1 with the same string: "L'<E9>quipe le quotidien" (this is a copy/paste from less in my shell).

Converting the content of the Windows file with decode('iso-8859-1').encode('utf8') results in a string that is different from the one in the Windows file: L'équipe le quotidien.

What is the best way to do this comparison? I seem to be unable to convert the latin1 string into utf-8.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Rui Pacheco
  • 110
  • 1
  • 7

1 Answers1

5

Your file is not encoded to Latin-1 (iso-8859-1). You created a Mojibake instead; if interpreted as a Unicode string I had to encode back to Latin-1 then decode as UTF-8 instead:

>>> print u"L'équipe le quotidien.".encode('latin1').decode('utf8')
L'équipe le quotidien.

Generally speaking, you'd decode both files to unicode objects before comparing. Even then, you can still run into issues with Combining Diacritical Marks, where the letter é is actually represented with two codepoints, U+0065 LATIN SMALL LETTER E and U+0301 COMBINING ACUTE ACCENT.

You can work around that up to a point by normalising the text; pick one of decomposed or composed and normalise both strings to the same form; use the unicodedata.normalize() function. See Normalizing Unicode for more details on that.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343