I've been struggling with this problem for a while but working with encoding is so painful that I have to come to your smarter minds for some help.
In a trip I made to Ukraine a friend copied to my pen drive me some Ukrainian named files. However, as you might expect, in the process of copying to my computer the filenames became impossible to read rubbish, such as this:
Ôàíòîì
Well, I have strong reasons to believe that the original filenames were encoding using CP1251 (I know this because I manually checked encode tables and manage to translate correctly the name of the band). What apparently happened is that, in the process of copying, the CP1251 codes where maintained and the OS now just interprets them as Unicode codes.
I tried to "interpret" the codes in Python with the following script:
print u"Ôàíòîì".decode('cp1251')
It doesn't feel right though. The result is complete rubbish as well:
Ôà Гòîì
If i do:
print repr(u"Ôàíòîì".decode('cp1251'))
I obtain:
u'\u0413\u201d\u0413\xa0\u0413\xad\u0413\u0406\u0413\xae\u0413\xac'
I found out that if I could get all the code points in Unicode and just offset them by 0x350 I would place them in the correct place for Ukrainian cyrillic. But I don't know how to do that and probably there is an answer which is more conceptually correct than this.
Any help would be greatly appreciated!
Edit: Here is an example of the correct translation
Ôàíòîì should translate to Фантом.
Ô 0x00D4 -> Ф 0x0424
à 0x00E0 -> а 0x0430
í 0x00ED -> н 0x043D
ò 0x00F2 -> т 0x0442
î 0x00EE -> о 0x043E
ì 0x00EC -> м 0x043C
As I stated before, there is an 0x0350 offset between the correct and wrong code points.
(ok, the files are music files... I guess you suspected that...)
Some other test strings (whose translation I don't know): Áåç êîíò›îëfl Äâîº Êàï_òîøêà Ïîäèâèñü