19

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.

I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.

cp1252_to_unicode = {
    "\x85": u'\u2026', # …
    "\x91": u'\u2018', # ‘
    "\x92": u'\u2019', # ’
    "\x93": u'\u201c', # “
    "\x94": u'\u201d', # ”
    "\x97": u'\u2014'  # —
}

for l in open('file.txt'):
    for c, u in cp1252_to_unicode.items():
        l = l.replace(c, u)

But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:

"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Any ideas for how to deal with this?

Keith Hughitt
  • 4,860
  • 5
  • 49
  • 54
  • 1
    Doubt it will fix your issue, but [``str.translate()``](http://docs.python.org/library/stdtypes.html#str.translate) is far better suited to what you are trying to do than a bunch of replaces. e.g: ``cp1252_to_unicode = string.maketrans({...})`` then ``l.translate(cp1252_to_unicode)``. – Gareth Latty Apr 04 '12 at 11:05
  • It is very hard to believe that only the those Windowsy punctuation characters were originally cp1252... are you aware of how the mixup happened? Are you sure that your UTF8-encoded characters decode into * meaningful* unicode? What language is the text written in? – John Machin Apr 04 '12 at 11:48
  • Unfortunately I don't have too much information about how the files became corrupted in the first place. The files are written in English and were probably not originally encoded as Unicode, but simply as Ascii (99% of the text is plain Ascii). I am guessing that someone working on Windows inserted the characters (em dash, etc) either using an editor that did so for them or using the alt- shortcuts. I looked up the Unicode characters manually, so those should work if they are used as replacements and the file read out as Unicode. – Keith Hughitt Apr 05 '12 at 09:54

5 Answers5

30

If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -

However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.

In my utf-8 terminal, I can build a mixed incorrect string like this:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma�� 
>>> a.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.

The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("cp1252")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

And on the console:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã 
Keith Hughitt
  • 4,860
  • 5
  • 49
  • 54
jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • Good answer, but including an example would make it even better. – Duncan Apr 04 '12 at 11:25
  • @Duncan: I was working on the example - took me sometime due to the ctach mentioned above. – jsbueno Apr 04 '12 at 11:43
  • I wonder if you're the first person to actually try to write this code? It sounds like a bug. Sorry I can't give you a second up vote for the example code. – Duncan Apr 04 '12 at 15:18
  • I think it is not a bug - it reraises the error from the first char in the chain it does not recognize as utf-8. – jsbueno Apr 04 '12 at 18:19
  • Thanks for the solution, jsbeuno! I'm surprised how complicated it is to simply replace the characters when you have more than one in a string. The above approach *almost* gets it, but it ends up returning multiple copies of some parts of a string, e.g., for "one\x85two\x97three\x92four", the result is "one…twotwowoo—threethreehreereeeee’four"... any ideas? – Keith Hughitt Apr 05 '12 at 10:49
  • 1
    Hi Keith - sorry I had not tested for all cases - the item [2] in the tuple is not the error sart as I first thought. But I found the "unicode error" object to have an "start" attribute that is the number I was expecting to be - try it now. It certainly does have room for improvement though – jsbueno Apr 05 '12 at 22:00
  • As for the "complication" attempt to the fact that this approch has nothing in common with the replacements you had tought off - we are actually implementing a fallback parser for encoding with this. – jsbueno Apr 05 '12 at 22:01
  • Great! It looks like you can actually get away with having to use the global variable using those params, e.g.: def cp1252_decoder(unicode_error): start = unicode_error.start end = unicode_error.end return unicode_error.object[start:end].decode("cp1252"), end Or do you think it is still necessary?.. Thanks for all of your help. – Keith Hughitt Apr 06 '12 at 09:57
  • I haven't checked if the "end" parameter is always correct - but if it is , and you adjust the "position" returnparameter porperly that may be possible, and would be a better solution than the one in my example. – jsbueno Apr 06 '12 at 13:18
  • I'm getting the following error: `TypeError: 'UnicodeDecodeError' object is not subscriptable` stepping through the code it looks to be thrown on the second line of `mixed_decoder` any clue as to how to get it working? – amadib Mar 12 '13 at 20:27
  • In many cases it could be a good idea to set pythons default encoding to utf-8 while parsing streams `reload(sys).setdefaultencoding('utf-8')`. You have to `import sys` for that to work. – Sprinterfreak Oct 13 '17 at 06:08
7

With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.

#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")

This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.

import codecs    
replacement = {
   '85' : '...',           # u'\u2026' ... character.
   '96' : '-',             # u'\u2013' en-dash
   '97' : '-',             # u'\u2014' em-dash
   '91' : "'",             # u'\u2018' left single quote
   '92' : "'",             # u'\u2019' right single quote
   '93' : '"',             # u'\u201C' left double quote
   '94' : '"',             # u'\u201D' right double quote
   '95' : "*"              # u'\u2022' bullet
}

#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
    errStr = unicodeError[1]
    errLen = unicodeError.end - unicodeError.start
    nextPosition = unicodeError.start + errLen
    errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
    if errHex in replacement:
        return u'%s' % replacement[errHex], nextPosition
    return u'%s' % errHex, nextPosition   # Comment this line out to get a question mark
    return u'?', nextPosition

codecs.register_error("mixed", mixed_decoder)

xmlText = xmlText.decode("utf-8", "mixed")

Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.

This is not pretty but it does allow me to make sense of messed up data

AnthonyVO
  • 3,821
  • 1
  • 36
  • 41
1

Good solution that of @jsbueno, but there is no need of global variable last_position, see:

def mixed_decoder(error: UnicodeError) -> (str, int):
     bs: bytes = error.object[error.start: error.end]
     return bs.decode("cp1252"), error.start + 1

import codecs
codecs.register_error("mixed", mixed_decoder)

a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"

s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
Asaga
  • 591
  • 4
  • 6
1

This is usually called Mojibake.

There's a nice Python library that might solve these issues for you called ftfy.

Example:

>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Avamander
  • 497
  • 16
  • 31
1

Just came into this today, so here is my problem and my own solution:

original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'

def mixed_decoding(s):
    output = ''
    ii = 0
    for c in s:
        if ii <= len(s)-1:
            if s[ii] == '\\' and s[ii+1] == 'x':
                b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
                output = output+b
                ii += 3
            else:
                output = output+s[ii]
        ii += 1
    print(output)
    return output

decoded_string = mixed_decoding(original_string)

Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.

Julio S.
  • 944
  • 1
  • 12
  • 26