1

I have a string (actually file name) like: ноÑÑажнаÑ. it is heritage of broken Lenovo NAS and samba configuration.

enca report: Universal transformation format 8 bits; UTF-8 Doubly-encoded to UTF-8 from ISO-8859-5

How can i recover string (file name) using perl/shell/python?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Andrii Petrenko
  • 163
  • 3
  • 6
  • 1
    The string you gave us is already beyond repair. But you'd decode from UTF-8, encode to Latin-1, decode from UTF-8, encode to ISO-8850-5. Have you got the *original* string for us? – Martijn Pieters Oct 01 '14 at 15:42
  • I can get to `но??ажна?.` which is *almost* there, but the question marks indicate missing / broken UTF-8 byte sequences. – Martijn Pieters Oct 01 '14 at 15:45
  • i can read file name from filesystem, but o have: UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data – Andrii Petrenko Oct 01 '14 at 16:26

3 Answers3

4

You'll have to reverse the process. In Python, you can encode Unicode values to Latin-1 to get one-on-one bytes again, so the process would be:

  • Decode from UTF-8 to Unicode
  • Encode from Unicode to Latin-1
  • Decode from UTF-8 to Unicode again
  • Encode to ISO-8859-5

Your mangled text is missing characters that were not printable. If I ignore the broken characters, I get:

>>> 'ноÑÑажнаÑ.'.decode('utf8').encode('latin1').decode('utf8', 'ignore').encode('iso8859_5')
'\xdd\xde\xd0\xd6\xdd\xd0.'

Printing the result before encoding to ISO-8858-5, but replacing broken characters with a placeholder:

>>> print 'ноÑÑажнаÑ.'.decode('utf8').encode('latin1').decode('utf8', 'replace')
но��ажна�.
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you. it works in the 50% cases, in other 50% i have:UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data – Andrii Petrenko Oct 01 '14 at 16:20
  • @AndriiPetrenko: in which case your string appears to be beyond repair. – Martijn Pieters Oct 01 '14 at 16:26
  • @AndriiPetrenko: can you paste the `print repr()` output of such failing files instead please? – Martijn Pieters Oct 01 '14 at 17:47
  • @AndriiPetrenko: If I ignore the decoding and encoding errors (`'ignore' as second argument) I get the value `S01E01 (АлNминиеваN NолNга, СноNбоNдN, КонNакNнNе линзN, Хлеб).avi`. Can you give me any clue as to what we are now missing here? – Martijn Pieters Oct 02 '14 at 14:28
  • @AndriiPetrenko: the original filename as intended would be helpful, because that'll give us a clue as to what values might be missing in the original to cause this. – Martijn Pieters Oct 02 '14 at 14:28
  • @AndriiPetrenko: the original error is because the UTF-8 input is `N\xcc\x83\xc2\x8e` in places, where the `\xcc\x83\xc2` is a valid UTF-8 sequence but I suspect that the `\xc2\x8e` is instead the bytes we want to treat separately. – Martijn Pieters Oct 02 '14 at 14:30
  • @AndriiPetrenko: having taken a stab at it with various other encodings, I cannot help but feel there are bytes missing somewhere. – Martijn Pieters Oct 02 '14 at 14:59
0

I had a very similar problem, judging by enca -L ru broken-file.txt output:

Universal transformation format 8 bits; UTF-8
  Surrounded by/intermixed with non-text data
  Doubly-encoded to UTF-8 from ISO-8859-5

The answer above did not solve the problem, so I've tried the following variation:

def decode(contents):
    u = contents.decode("utf-8")
    d = u.encode("raw_unicode_escape")
    return d.decode("cp1251")

# Can be used like:
decode(open('broken-file.txt', "b").read())

Please, note that in my case enca provided wrong information: I replaced ISO-8859-5 with Windows-1251 because the former is barely used anywhere. Also, used raw_unicode_escape instead of latin-1, kudos to Decoding double encoded utf8 in Python

Roman Susi
  • 4,135
  • 2
  • 32
  • 47
0

I'm not sure this text is salvageable but as a generic answer there's a great Python package called ftfy which attempts to recover malformed text and can explain its processing.

The basic CLI usage looks like this:

$ echo "ноÑÑажнаÑ" | ftfy
ноÑÑажнаÑ
$ echo "ноÑÑажнаÑ" | ftfy -e iso-8859-5
УТНУТОУ'У'УТАУТЖУТНУТАУ'

I've used it with other inputs successfully like this:

$ echo 'Juan Cañas' | ftfy
Juan Cañas

With the Python API, you can get explanations and handle them:

>>> ftfy.fix_and_explain('Juan Cañas')
ExplainedText(text='Juan Cañas', explanation=[('encode', 'sloppy-windows-1252'), ('decode', 'utf-8'), ('normalize', 'NFC')])
Chris Adams
  • 4,966
  • 1
  • 30
  • 28