How to recover string with broken charset to unicode?

Question

I have a string (actually file name) like: Ð½Ð¾ÑÑÐ°Ð¶Ð½Ð°Ñ. it is heritage of broken Lenovo NAS and samba configuration.

enca report: Universal transformation format 8 bits; UTF-8 Doubly-encoded to UTF-8 from ISO-8859-5

How can i recover string (file name) using perl/shell/python?

The string you gave us is already beyond repair. But you'd decode from UTF-8, encode to Latin-1, decode from UTF-8, encode to ISO-8850-5. Have you got the *original* string for us? — Martijn Pieters, Oct 01 '14 at 15:42
I can get to `но??ажна?.` which is *almost* there, but the question marks indicate missing / broken UTF-8 byte sequences. — Martijn Pieters, Oct 01 '14 at 15:45
i can read file name from filesystem, but o have: UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data — Andrii Petrenko, Oct 01 '14 at 16:26

score 4 · Accepted Answer · answered Oct 01 '14 at 15:48

4

You'll have to reverse the process. In Python, you can encode Unicode values to Latin-1 to get one-on-one bytes again, so the process would be:

Decode from UTF-8 to Unicode
Encode from Unicode to Latin-1
Decode from UTF-8 to Unicode again
Encode to ISO-8859-5

Your mangled text is missing characters that were not printable. If I ignore the broken characters, I get:

>>> 'Ð½Ð¾ÑÑÐ°Ð¶Ð½Ð°Ñ.'.decode('utf8').encode('latin1').decode('utf8', 'ignore').encode('iso8859_5')
'\xdd\xde\xd0\xd6\xdd\xd0.'

Printing the result before encoding to ISO-8858-5, but replacing broken characters with a placeholder:

>>> print 'Ð½Ð¾ÑÑÐ°Ð¶Ð½Ð°Ñ.'.decode('utf8').encode('latin1').decode('utf8', 'replace')
но��ажна�.

answered Oct 01 '14 at 15:48

Martijn Pieters

1,048,767
296
4,058
3,343

Thank you. it works in the 50% cases, in other 50% i have:UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data – Andrii Petrenko Oct 01 '14 at 16:20
@AndriiPetrenko: in which case your string appears to be beyond repair. – Martijn Pieters Oct 01 '14 at 16:26
@AndriiPetrenko: can you paste the `print repr()` output of such failing files instead please? – Martijn Pieters Oct 01 '14 at 17:47
@AndriiPetrenko: If I ignore the decoding and encoding errors (`'ignore' as second argument) I get the value `S01E01 (АлNминиеваN NолNга, СноNбоNдN, КонNакNнNе линзN, Хлеб).avi`. Can you give me any clue as to what we are now missing here? – Martijn Pieters Oct 02 '14 at 14:28
@AndriiPetrenko: the original filename as intended would be helpful, because that'll give us a clue as to what values might be missing in the original to cause this. – Martijn Pieters Oct 02 '14 at 14:28
@AndriiPetrenko: the original error is because the UTF-8 input is `N\xcc\x83\xc2\x8e` in places, where the `\xcc\x83\xc2` is a valid UTF-8 sequence but I suspect that the `\xc2\x8e` is instead the bytes we want to treat separately. – Martijn Pieters Oct 02 '14 at 14:30
@AndriiPetrenko: having taken a stab at it with various other encodings, I cannot help but feel there are bytes missing somewhere. – Martijn Pieters Oct 02 '14 at 14:59

score 0 · Answer 2 · answered Jul 16 '17 at 07:33

I had a very similar problem, judging by enca -L ru broken-file.txt output:

Universal transformation format 8 bits; UTF-8
  Surrounded by/intermixed with non-text data
  Doubly-encoded to UTF-8 from ISO-8859-5

The answer above did not solve the problem, so I've tried the following variation:

def decode(contents):
    u = contents.decode("utf-8")
    d = u.encode("raw_unicode_escape")
    return d.decode("cp1251")

# Can be used like:
decode(open('broken-file.txt', "b").read())

Please, note that in my case enca provided wrong information: I replaced ISO-8859-5 with Windows-1251 because the former is barely used anywhere. Also, used raw_unicode_escape instead of latin-1, kudos to Decoding double encoded utf8 in Python

score 0 · Answer 3 · answered Jun 04 '21 at 21:31

I'm not sure this text is salvageable but as a generic answer there's a great Python package called ftfy which attempts to recover malformed text and can explain its processing.

The basic CLI usage looks like this:

$ echo "Ð½Ð¾ÑÑÐ°Ð¶Ð½Ð°Ñ" | ftfy
ноÑÑажнаÑ
$ echo "Ð½Ð¾ÑÑÐ°Ð¶Ð½Ð°Ñ" | ftfy -e iso-8859-5
УТНУТОУ'У'УТАУТЖУТНУТАУ'

I've used it with other inputs successfully like this:

$ echo 'Juan CanÌƒas' | ftfy
Juan Cañas

With the Python API, you can get explanations and handle them:

>>> ftfy.fix_and_explain('Juan CanÌƒas')
ExplainedText(text='Juan Cañas', explanation=[('encode', 'sloppy-windows-1252'), ('decode', 'utf-8'), ('normalize', 'NFC')])

How to recover string with broken charset to unicode?

3 Answers3