I'm parsing an XML file with SAX in Python. The XML is read from an HTTP stream via an urllib.request.
It seems that the XML stream contains invalid characters however. Specifically, when decoding it from UTF-8 and dumping it to file, it looks like I get a bunch of instances of '8000' preceded and followed by line breaks. This causes SAX parsing to fail.
My question is twofold:
- How can I remove or ignore invalid characters as they come along in an urllib.request datastream?
- What is '8000' likely to be, and is there a more specific fix for that issue?
[edit]
I cannot share the source data, but this is the first few characters as string and hex. The first characters are the offending "8000" character.
String:
8000<?xml
Hex:
38:30:30:30:3c:3f:78:6d:6c:20
The '8000' string is possible to search replace, but it's not a nice solution since data may contain that fairly common string.