Recovering filenames with bad encoding

Question

I've been struggling with this problem for a while but working with encoding is so painful that I have to come to your smarter minds for some help.

In a trip I made to Ukraine a friend copied to my pen drive me some Ukrainian named files. However, as you might expect, in the process of copying to my computer the filenames became impossible to read rubbish, such as this:

Ôàíòîì

Well, I have strong reasons to believe that the original filenames were encoding using CP1251 (I know this because I manually checked encode tables and manage to translate correctly the name of the band). What apparently happened is that, in the process of copying, the CP1251 codes where maintained and the OS now just interprets them as Unicode codes.

I tried to "interpret" the codes in Python with the following script:

print u"Ôàíòîì".decode('cp1251')

It doesn't feel right though. The result is complete rubbish as well:

Г”Г ГГІГ®Г¬

If i do:

print repr(u"Ôàíòîì".decode('cp1251'))

I obtain:

u'\u0413\u201d\u0413\xa0\u0413\xad\u0413\u0406\u0413\xae\u0413\xac'

I found out that if I could get all the code points in Unicode and just offset them by 0x350 I would place them in the correct place for Ukrainian cyrillic. But I don't know how to do that and probably there is an answer which is more conceptually correct than this.

Any help would be greatly appreciated!

Edit: Here is an example of the correct translation

Ôàíòîì should translate to Фантом.

Ô 0x00D4 -> Ф 0x0424
à 0x00E0 -> а 0x0430
í 0x00ED -> н 0x043D
ò 0x00F2 -> т 0x0442
î 0x00EE -> о 0x043E
ì 0x00EC -> м 0x043C

As I stated before, there is an 0x0350 offset between the correct and wrong code points.

(ok, the files are music files... I guess you suspected that...)

Some other test strings (whose translation I don't know): Áåç êîíò›îëﬂ Äâîº Êàï_òîøêà Ïîäèâèñü

Possible duplicate of http://stackoverflow.com/questions/7555335/how-to-convert-a-string-from-cp1251-to-utf8 — saulspatz, Aug 22 '15 at 15:29
This is not CP1251; it looks like a multi-byte [Mojibake](https://en.wikipedia.org/wiki/Mojibake) instead; you had UTF-8 or similar and it was decoded wrong. You could see if the [`ftfy` library](http://ftfy.readthedocs.org/en/latest/) can make anything of it. It can't for the sample you gave though. — Martijn Pieters, Aug 22 '15 at 15:30
Can you share with us the expected value? Then we can try to work backwards to see how the Mojibake was created and reverse the process. And take into account that bytes may have been *dropped* as they don't map to printable characters. Give us the `print repr(value)` output, not the `print value` output, for us to be doing anything meaningful here. — Martijn Pieters, Aug 22 '15 at 15:31
I added the translation example. I think it's CP1251 because if I get the hex values from the first column in the example and manually look up in a CP1251 charset I can obtain the correct name. — Felipe Ferri, Aug 22 '15 at 15:40
the following `u'Ôàíòîì'.encode('cp1252').decode('cp1251')` comes out as "Фантом", but the same trick on the other sample has trailing garbage: `u"Áåç êîíò›îëﬂ".encode("cp1252", 'replace').decode('cp1251')` == "Без конт›ол?" — SingleNegationElimination, Aug 22 '15 at 15:50
Do you think it's possible to interpret that garbage? I think maybe the original files shared standard ASCII and non-ASCII characters... — Felipe Ferri, Aug 22 '15 at 16:19

Rolf of Saxony · Answer 1 · 2015-08-24T06:31:03.433

>>> a = u'Ôàíòîì'.encode('8859').decode('cp1251')   
>>> print a   
Фантом

If you look at the individual characters in your samples most of them come from Cyrillic but you have others in there from Greek and Coptic, Latin Extended B and u'fe52' is a full-stop from the back of beyond. So it's a bit of a mess.
EDIT:

a = u'Ôàíòîì'.encode('cp1252').decode('cp1251')
print a
Фантом
a = u'Äâîº Êàï_òîøêà'.encode('cp1252').decode('cp1251')
print a
Двоє Кап_тошка
a = u'Ïîäèâèñü'.encode('cp1252').decode('cp1251')
print a
Подивись
a = u'Áåç êîíò›îë'.encode('cp1252').decode('cp1251')
print a
Без конт›ол

cp1252 works for the given samples, except for Áåç êîíò›îëﬂ where the Latin Small Ligature Fl U+FB02 appears to be superfluous

Felipe Ferri · Accepted Answer · 2017-06-14T17:23:08.923

I found out that, besides the filenames, all my files had incorrectly encoded metadata.

I found out that the id3 metadata standard for mp3 files only supports latin1, utf8 and utf16 encodings.

My files all contained CP1251 data that were set as latin1 on the mp3 files. Probably in Russia and cyrillic-writing countries all music players are set to understand that latin1 should be interpreted as CP1251, which was not the case for me.

I used Python and mutagen for correcting the metadata. When reading the mp3 metadata, mutagen assumed that the data was encoded as latin1, showing garbled characters as a result. What I had to do was to get those garbled characters, encode them into latin1 again AND decode as CP1251, obtaining unicode. Then I overwrote the mp3 metadata and mutagen understood then that the unicode should be saved as utf-8. With that all the metadata was correct.

To correct the files metadata I used the following Python script:

from mutagen.easyid3 import EasyID3

def decode_song_metadata(filename):
    id3 = EasyID3(filename)
    for key in id3.valid_keys:
        val = id3.get(key)
        if val:
            print key
            decoded = val[0].encode('latin1').decode('cp1251')
            print decoded
            id3[key] = decoded
    id3.save()

def correct_metadata():
    paths = [u'/Users/felipe/Downloads/Songs']    

    for path in paths:
        print 'path: ' + decode_filename(path)
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                try:
                    decode_song_metadata(os.path.join(dirpath, filename))
                except:
                    print filename


if __name__ == '__main__':
    correct_metadata()

This corrected the mp3 metadata, however correcting the filenames required a different trick, because they had a differend encoding problem. What I think happend was that the original filenames were in CP1251, but when they were copied from my fat32 formatted usb-stick to my Mac, macOS interpreted the filenames as latin1. This originated filenames with weird accented characters, which were encoded in UTF-16 in "Normal Form Decomposed", where each accent is saved as a different unicode character than the main letter. Also macOS added a BOM marker that polluted the filename. So in order to correct this I had to do the reverse operation:

get the filename. This returns a unicode string which latin accented characters in the Normal Form Decomposed.
we have to convert again to Normal Form Composed.
then we encode in UTF-16.
we remove the BOM.
we decode interpreting as CP1251.

In order to decode the filenames then I used the following script:

def decode_filename(filename):
    # MacOS filenames are stored in Unicode in "Normal Form Decomposed"
    # form, where the accents are saved separated from the main
    # character. Because the original characters weren't proper
    # accentuated letters, in order to recover them we have to decompose
    # the filenames.
    # http://stackoverflow.com/a/16467505/212292
    norm_filename = unicodedata.normalize('NFC', filename)
    utf16 = norm_filename.encode('utf16')
    bom = codecs.BOM_UTF16

    if utf16.startswith(bom):
        # We have to remove the BOM bytes
        utf16 = utf16[len(bom):]

    cp1251 = utf16.decode('cp1251')
    return cp1251

This should be used with the unicode returned by running the os.walk() method.

Though the above script works, I ended up not using it for correcting the filenames. I was using iTunes with the "Auto organizer" function enabled. This was great because everytime I would play a song on iTunes it would get the mp3 metadata (which I already corrected using the first script above) and rename the mp3 file to contain the song name, and even the folder. I find this better than correcting the filenames because this also renames correctly the folders and puts filenames which make sense to the song.

score 1 · Answer 3 · answered Aug 22 '15 at 15:52

1

You can add this 0x350 offset like that:

Python 2:

>>> s = u'Ôàíòîì'
>>> decoded = u''.join([unichr(ord(c)+0x350) for c in s])
>>> print decoded
Фантом

answered Aug 22 '15 at 15:52

user2386841

51
3

Wow this works! But I would like to use an encoding/decoding answer to solve this problem... – Felipe Ferri Aug 22 '15 at 16:03

score 1 · Answer 4 · answered Aug 22 '15 at 15:57

1

>>> u'Ôàíòîì'.encode('latin1').decode('cp1251')
'Фантом'

answered Aug 22 '15 at 15:57

dan04

87,747
23
163
198

I like this question. It works with the 'Фантом' string, but when I go to the next one it can't convert some characters. For example, this one: 'Áåç êîíò›îëﬂ' returns an error, but if I remove the characters "›" and "ﬂ" it translates to: 'Без контол' (which appears to be correct, minus some removed characters). – Felipe Ferri Aug 22 '15 at 16:06

Recovering filenames with bad encoding

4 Answers4