I'm loading some data, processing it, then sending data to an application which (fair enough) doesn't allow the invalid utf8 noncharacter
s U+FDD0
through U+FDEF
, as well as the invalid U+FFFE
and U+FFFF
special characters.
My raw data is out of my control, and some it happens to contain invalid characters that I want to clean.
However, my python code is still sending the application invalid utf8, as it doesn't ignore the noncharacters and other invalid characters.
For example
b'\xef\xbf\xbf'.decode('utf-8', 'ignore')
returns '\uffff'
instead of ignoring the invalid character, and encode
has the same behaviour.
I first debugged this with U+FFFE, which has a wontfix bug related to the BOM. https://bugs.python.org/issue765036
Then I found this massive email list thread (https://bugs.python.org/issue12729) claiming that it's ok to emit noncharacters because applications may want to keep them for internal use.
However, is there any nice python way to emit 'transmitabble' utf8 without these noncharacters and other invalid chars like U+FFFF
?