0

I've been struggling with encoding for a while as I'm biulding a multi-lingual database with sqlite3 in Python. So far, I've solved everything, thanks to Google and articles on Stack Overflow. I had problems with Russian, Slovenian, Polish, Spanish, French... but it's all solved, appart from this ONE file I can't fix.

I thought I had found a possible solution on this website: http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/, I even found a decoder, which got me reeeally close to solving the problem. But it only produced partially understandable Russian... (I'm sure it can help in other cases though: http://2cyr.com/decode/?lang=fr and it also exists in English).

But this last file is gonna be the end of me. Here's the major issue: I KNOW it's Russian because the linguist who gave it to me built it, and knows it's in Russian. BUT, the file itself looks like this:

£ËÁÀÝÅÅ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÇÏ    UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÊ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍÕ    UNK £ËÁÀÝÉÊ UNKA

According to my shell, it's encoded in utf-8. I've therefore been trying to decode utf-8 and encode it into all russian encodings I could find (ISO-8859-5, koi8_r, koi8_u, cp1252, cp1251...). It never worked. I also tried saving the file in all these encodings and decoding the other way around, without much success...

It has to go in a database (sqlite), and I know the required encoding for this is utf-8. The previous Russian file I delt with was "properly" written (in cyrillic), and I just had to figure out which encoding to use. But here, I feel like I've tried everything, I'm just not getting any results...

I'm actually wondering if decoding such a file is even possible, since it's not cyrillic to start with.

Any suggestion would be welcome :)

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Xhattam
  • 195
  • 1
  • 11
  • What does the `repr()` of a line in the file look like? Also, is this Python 2 or 3? In Python 3, you want to open the file in binary mode to inspect the contents without decoding. – Martijn Pieters May 06 '14 at 14:41
  • The `sqlite3` Python database adapter handles Python's `unicode` datatype *just fine*; there is no need to encode the data yourself. Just *decode* from the file (especially if it is UTF-8) and that's it. – Martijn Pieters May 06 '14 at 14:43
  • Thanks, I did that already but get this error:UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) – Xhattam May 06 '14 at 15:12
  • This is a repr of a line with no decode/encode: '\xc2\xa3\xc3\x8b\xc3\x81\xc3\x94\xc3\x98' – Xhattam May 06 '14 at 15:14

1 Answers1

0

The first and foremost problem - the text is not in UTF-8, it is in KOI8R. So if you need to decode via Python, you may refer to this answer - string encode / decode - it might give you some clue.

I have decoded the text specified by you - enjoy:

ёкающее UNK ёкающий UNKA
ёкающего    UNK ёкающий UNKA
ёкающей UNK ёкающий UNKA
ёкающем UNK ёкающий UNKA
ёкающему    UNK ёкающий UNKA
Community
  • 1
  • 1
Andy W
  • 2,082
  • 1
  • 13
  • 9
  • Thanks for this, but I came across this possible decoded translation already, and after checking the meaning of many decoded words, they don't mean anything at all in French... I'm thinking it's a resulting decoding close to Russian, but that's not actual Russian... I'll have a look at the page you gave though, thanks again ! :) – Xhattam May 06 '14 at 15:10
  • Yes I have tried this one too, even the koi8_u codec, but still no cyrillic representation. My shell tells me the file is in utf-8, but you think it's koi8_r ? (I did read it's hard to "guess" an encoding from a file if you don't know it from the start...). I think there's a way, but it's just about finding the perfect decode/encode sequence (still working on it) :) – Xhattam May 06 '14 at 15:30
  • @Xhattam I'm native Russian speaker and the translation makes pretty much sense to me (otherwise I wouldn't post it). Those are different cases and genders of word which is translated as "the one who uses/says YO sound" - not sure why would you need it, but anyway, there is what you have got. – Andy W May 06 '14 at 19:28
  • It does ? Ok then, I'll trust you ! It's just because, as a non-speaker building a multi-lingual database, I have to rely on google translate to check the language is the one I'm looking for, and it didn't make sense in French, but I'm suspecting it might be because it "translated" the cyrillic characters into latin-readable ones, maybe... but thanks anyway, I'll go with that ! Thanks Andy ! :) – Xhattam May 07 '14 at 07:48