Check encoding and convert to Unicode

Question

I have a list with different strings. Sometimes they are in cp1251, ASCII or something else. I need to process them (convert to Unicode), because I got an error (UncicodeDecodeError), especially when I tried to dump this data to JSON.

How can I do this?

Do you have any indication of the encoding? Guessing the encoding is possible but is going to be imprecise. — Martijn Pieters, Feb 13 '13 at 15:27
Could you include a few examples of the input strings? Also, are you using Python 2.x or 3.x? — Jon-Eric, Feb 13 '13 at 15:27
ASCII is a subset of cp1251 (and just about every other encoding), so that's one part of your problem that isn't a problem. How do you know that some of the strings are in cp1251? If you are getting Russian data, "something else" could be koi8r. Or it could be UTF-8. Provide examples. Also tell us how you obtain a list of strings with different encodings. — John Machin, Feb 15 '13 at 11:26

Zero Piraeus · Accepted Answer · 2013-02-13T18:08:54.743

4

You can use chardet to detect the encoding of a string, so one way to convert a list of them to unicode (in Python 2.x) would be:

import chardet

def unicodify(seq, min_confidence=0.5):
    result = []
    for text in seq:
        guess = chardet.detect(text)
        if guess["confidence"] < min_confidence:
            # chardet isn't confident enough in its guess, so:
            raise UnicodeDecodeError
        decoded = text.decode(guess["encoding"])
        result.append(decoded)
    return result

... which you'd use like this:

>>> unicodify(["¿qué?", "什么？", "what?"])
[u'\xbfqu\xe9?', u'\u4ec0\u4e48\uff1f', u'what?']

CAVEAT: Solutions like chardet should only be used as a last resort (for instance, when repairing a dataset that's corrupt because of past mistakes). It's far too fragile to be relied on in production code; instead, as @bames53 points out in the comments to this answer, you should fix the code that corrupted the data in the first place.

edited Feb 13 '13 at 18:08

answered Feb 13 '13 at 15:48

Zero Piraeus

56,143
27
150
160

Thx! It's better decision of this problem I've ever seen! – Andrii Rusanov Feb 13 '13 at 16:34
1

Guessing at the encoding is not a good solution and should be avoided if at all possible. – bames53 Feb 13 '13 at 16:46
@bames53 My reading of OP's question is that guessing *is* necessary, though. – Zero Piraeus Feb 13 '13 at 16:51
1

Right, as the OP currently defines the problem; he has strings for which he has no idea of their encoding. As long as he doesn't change that then guessing is necessary, but the real solution would be for him to take a step back and fix the real problem, which is that he has strings for which he has no idea of their encoding. @user8289 – bames53 Feb 13 '13 at 17:12
@bames53 That's not always possible ... "Hello, Mrs. Peñaranda? Hi, it's Dave here from FooCorp. You ordered a MultiWidget in August of 2007, and I'd just like to check what default encoding you had set in your browser at the time. Mrs. Peñaranda? Hello?" – Zero Piraeus Feb 13 '13 at 17:16
1

You don't know that it's not possible in the OP's case though. For whatever reason many programmers are ignorant on the topic of encodings and simply don't realize that, although guessing may seem to sort of hide the problem, they could actually fix it reliably if they just didn't create the corrupted data (yes, strings in an unknown encoding are _corrupt_) in the first place. Since they may not know enough to ask, it's a good idea discuss this whenever the topic of guessing comes up. – bames53 Feb 13 '13 at 17:47
Secondly, in the case you describe guessing may be acceptable for a once-off, manual fix of the corrupted DB records, but guessing should absolutely not be built into the web app; the web app should be fixed to stop producing corrupt data. – bames53 Feb 13 '13 at 17:48

score 0 · Answer 2 · answered Feb 13 '13 at 15:33

0

If you know the encoding, it should be pretty easy:

unicode_string = encoded_string.decode(encoding)

If you don't know encoding it might be hard time detecting it, but it depends on encodings and languages you expect.

answered Feb 13 '13 at 15:33

Michal Čihař

9,799
6
49
87

feedMe · Answer 3 · 2018-09-03T10:10:57.203

0

Try using the unicode function to convert the string to the built-in unicode type.

>>> s = "Some string"
>>> s = unicode(s)
>>> type(s)
<type 'unicode'>

For your problem try this to create a new list of unicode strings.

new = []
for item in myList:
    new.append(unicode(item))

or using list comprehension

new = [unicode(item) for item in myList]

Read the official Python Unicode HOWTO.

edited Sep 03 '18 at 10:10

answered Feb 13 '13 at 15:49

feedMe

3,431
2
36
61

Check encoding and convert to Unicode

3 Answers3