7

As far as I know it is the concept of python to have only valid characters in a string, but in my case the OS will deliver strings with invalid encodings in path names I have to deal with. So I end up with strings that contain characters that are non-unicode.

In order to correct these problems I need to display these strings somehow. Unfortunately I can not print them because they contain non-unicode characters. Is there an elegant way to replace these characters somehow to at least get some idea of the content of the string?

My idea would be to process these strings character by character and check if the character stored is actually valid unicode. In case of an invalid character I would like to use a certain unicode symbol. But how can I do this? Using codecs seems not to be suitable for that purpose: I already have a string, returned by the operating system, and not a byte array. Converting a string to byte array seems to involve decoding which will fail in my case of course. So it seems that I'm stuck.

Do you have an tips for me how to be able to create such a replacement string?

Regis May
  • 3,070
  • 2
  • 30
  • 51

4 Answers4

15

If you have a bytestring (undecoded data), use the 'replace' error handler. For example, if your data is (mostly) UTF-8 encoded, then you could use:

decoded_unicode = bytestring.decode('utf-8', 'replace')

and U+FFFD � REPLACEMENT CHARACTER characters will be inserted for any bytes that can't be decoded.

If you wanted to use a different replacement character, it is easy enough to replace these afterwards:

decoded_unicode = decoded_unicode.replace('\ufffd', '#')

Demo:

>>> bytestring = b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'
>>> bytestring.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte
>>> bytestring.decode('utf8', 'replace')
'Føö�Bår'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Nice! I didn't see that in the documentation: I'd wish such features would be described more prominent. – Regis May Jul 25 '16 at 12:17
5

Thanks to you for your comments. This way I was able to implement a better solution:

    try:
        s2 = codecs.encode(s, "utf-8")
        return (True, s, None)
    except Exception as e:
        ret = codecs.decode(codecs.encode(s, "utf-8", "replace"), "utf-8")
        return (False, ret, e)

Please share any improvements on that solution. Thank you!

Regis May
  • 3,070
  • 2
  • 30
  • 51
1

You have not given an example. Therefore, I have considered one example to answer your question.

x='This is a cat which looks good 😊'
print x
x.replace('😊','')

The output is:

This is a cat which looks good 😊
'This is a cat which looks good '
OJ7
  • 307
  • 1
  • 2
  • 13
Chandan
  • 752
  • 6
  • 12
  • I have no concrete example about the actual byte pattern that led to errors in my case. The filter I am writing intends to identify Unicode encoding problems in given strings. One way you might be able reproduce that situation is to simply generate random data and then try to interpret this data as UTF-8. You typically will fail because this binary data will very likely violate UTF-8 standard. I'm sorry, I can not identify how these violations occurred. I was trying to identify there existence as a first step. – Regis May Jul 25 '16 at 10:24
  • you can try this if don't know the non-unicode characters: try: string.decode('utf-8') print "string is UTF-8, length %d bytes" % len(string) except UnicodeError: print "string is not UTF-8" – Chandan Jul 25 '16 at 10:26
  • Yes but the interesting part starts where I would not only like to identify if a string is Unicode or not, but to actually get some idea about the string itself by filtering or replacing 'characters' that are invalid.. – Regis May Jul 25 '16 at 10:33
  • if you can identify the invalid word, then you just replace that word. – Chandan Jul 25 '16 at 10:36
1

The right way to do it (at least in python2) is to use unicodedata.normalize:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

decode('utf-8', 'ignore') will just raise exception.

rubmz
  • 1,947
  • 5
  • 27
  • 49