3

I have data in CSV format that has been seriously scrambled character encoding wise, likely going back and forth between different software applications (LibreOffice Calc, Microsoft, Excel, Google Refine, custom PHP/MySQL software; on Windows XP, Windows 7 and GNU/Linux machines from various regions of the world...). It seems like somewhere in the process, non-ASCII characters have become seriously scrambled, and I'm not sure how to descramble them or detect a pattern. To do so manually would involve a few thousand records...

Here's an example. For "Trois-Rivières", when I open this portion of the CSV file in Python, it says:

Trois-Rivi\xc3\x83\xc2\x85\xc3\x82\xc2\xa0res

Question: through what process can I reverse

\xc3\x83\xc2\x85\xc3\x82\xc2\xa0

to get back

è

i.e. how can I unscramble this? How might this have become scrambled in the first place? How can I reverse engineer this bug?

M--
  • 25,431
  • 8
  • 61
  • 93
balleyne
  • 318
  • 1
  • 3
  • 8
  • 1
    It kind of looks like utf8 over iso over utf8... Maybe you should try with `iconv`? – fge Jan 02 '12 at 22:47
  • 2
    I doubt `\xc3\x83\xc2\x85\xc3\x82\xc2\xa0` converts to a single `è`. – Oded Jan 02 '12 at 22:47
  • `echo -e 'Trois-Rivi\xc3\x83\xc2\x85\xc3\x82\xc2\xa0res' | file -`: /dev/stdin: UTF-8 Unicode text, with LF, NEL line terminators – sehe Jan 02 '12 at 23:04
  • 1
    It looks like it got corrupted beyond repair (likely due to _wrong_ charset conversions before this point). I can't see a saner interpretation than utf8, but it looks corrupted even then – sehe Jan 02 '12 at 23:06
  • I think you're right, corrupted beyond repair due to multiple wrong charset conversions in the past... thanks. – balleyne Jan 03 '12 at 01:13
  • 2
    It looks like UTF-8 after a *double* UTF-8 as ISO-8859-1 misinterpretation, but decoding it this way gives U+0160 (Š) instead of the expected U+00E8 (è). – dan04 Jan 03 '12 at 03:05
  • @dan04: ...and that explains what happened: the original text was in [CP850](https://en.wikipedia.org/wiki/Code_page_850) or some closely related codepage (where è = 0x8A), but it was misinterpreted as [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) (where 0x8A = Š), converted to UTF-8, and then misinterpreted as ISO-8859-1 and reconverted to UTF-8 twice. Simple. :) – Ilmari Karonen Dec 28 '18 at 20:46

2 Answers2

3

You can check the solutions that were offered in: Double-decoding unicode in python

Another simpler brute force solution is to create a mapping table between the small set of scrambled characters using regular expression (((\\\x[a-c0-9]{2}){8})) search on your input file. For a file of a single source, you should have less than 32 for French and less than 10 for German. Then you can run "find and replace" using this small mapping table.

Community
  • 1
  • 1
Guy
  • 12,388
  • 3
  • 45
  • 67
2

Based on dan04's comment above, we can guess that somehow the letter "è" was misinterpreted as an "Š", which then had three-fold UTF-8 encoding applied to it.

So how did "è" turn into "Š", then? Well, I had a hunch that the most likely explanation would be between two different 8-bit charsets, so I looked up some common character encodings on Wikipedia, and found a match: in CP850 (and in various other related 8-bit DOS code pages, such as CP851, CP853, CP857, etc.) the letter "è" is encoded as the byte 0x8A, which in Windows-1252 represents "Š" instead.

With this knowledge, we can recreate this tortuous chain of mis-encodings with a simple Unix shell command line:

$ echo "Trois-Rivières" \
  | iconv -t cp850 \
  | iconv -f windows-1252 -t utf-8 \
  | iconv -f iso-8859-1 -t utf-8 \
  | iconv -f iso-8859-1 -t utf-8 \
  | iconv -f ascii --byte-subst='\x%02X'

Trois-Rivi\xC3\x83\xC2\x85\xC3\x82\xC2\xA0res

Here, the first iconv call just converts the string from my local character encoding — which happens to be UTF-8 — to CP850, and the last one just encodes the non-ASCII bytes with Python-style \xNN escape codes. The three iconv calls in the middle recreate the actual re-encoding steps applied to the data: first from (assumed) Windows-1252 to UTF-8, and then twice from ISO-8859-1 to UTF-8.

So how can we fix it? Well, we just need to apply the same steps in reverse:

$ echo -e 'Trois-Rivi\xC3\x83\xC2\x85\xC3\x82\xC2\xA0res' \
  | iconv -f utf-8 -t iso-8859-1 \
  | iconv -f utf-8 -t iso-8859-1 \
  | iconv -f utf-8 -t windows-1252 \
  | iconv -f cp850

Trois-Rivières

The good news is that this process should be mostly reversible. The bad news is that any "ü", "ì", "Å", "É" and "Ø" letters in the original text may have been irreversibly mangled, since the bytes used to encode those letters in CP850 are undefined in Windows-1252. (If you're lucky, they may have been interpreted as the same C1 control codes that those bytes represent in ISO-8859-1, in which case back-conversion should in principle be possible. I haven't managed to figure out how to convince iconv to do it, though.)

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153