The XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Given a string, how can I remove all illegal characters from it?
I came up with the following regular expression, but it's a bit of a mouthful.
illegal_xml_re = re.compile(u'[\x00-\x08\x0b-\x1f\x7f-\x84\x86-\x9f\ud800-\udfff\ufdd0-\ufddf\ufffe-\uffff]')
clean = illegal_xml_re.sub('', dirty)
(Python 2.5 doesn't know about Unicode chars above 0xFFFF, so no need to filter those.)