Removing invalid characters from XML stream

Question

I'm parsing an XML file with SAX in Python. The XML is read from an HTTP stream via an urllib.request.

It seems that the XML stream contains invalid characters however. Specifically, when decoding it from UTF-8 and dumping it to file, it looks like I get a bunch of instances of '8000' preceded and followed by line breaks. This causes SAX parsing to fail.

My question is twofold:

How can I remove or ignore invalid characters as they come along in an urllib.request datastream?
What is '8000' likely to be, and is there a more specific fix for that issue?

[edit]

I cannot share the source data, but this is the first few characters as string and hex. The first characters are the offending "8000" character.

String:

8000<?xml

Hex:

38:30:30:30:3c:3f:78:6d:6c:20

The '8000' string is possible to search replace, but it's not a nice solution since data may contain that fairly common string.

Are you sure it comes in UTF-8? Can you provide a link to the raw data or present an hexdump? — forty-two, Oct 18 '18 at 10:29
Well, not 1000 % sure I suppose. I will update the question with a hexdump. — hexamon, Oct 18 '18 at 10:55
I added my own answer below. It seems 8000 was the port number that was written via the HTTPResponse object for some reason. — hexamon, Oct 19 '18 at 10:49

kjhughes · Answer 1 · 2018-10-18T20:29:13.733

<?xml is the beginning of an XML declaration.

There can only be at most one XML declaration in an XML document, and it may only appear as the very first thing in the file. For "8000" to precede it renders the XML document not well-formed. Before trying to parse this stream as XML, you'll have to ensure that no more than one XML declaration exists and nothing precedes it. This has to be done at the character/string/text level – not at the XML level.

See also Error: The processing instruction target matching "[xX][mM][lL]" is not allowed

score 0 · Accepted Answer · answered Oct 19 '18 at 10:42

It seems that the code fed the XML parser the file parser object from the HTTPResponse (ie. HTTPResponse.fp) sent from urllib.request.urlopen, instead of just the HTTPResponse. For some reason, this caused the port (8000) to be written in each buffered chunk of the BufferedReader. It seems that this was an issue caused by migrating from Python 2 to 3 (perhaps the HTTPResponse object behaved differently in Python 2).

By feeding the XML parser with the HTTPResponse directly instead of response.fp the port was dropped from the bytestream, and no further encoding issues were present.

Removing invalid characters from XML stream

2 Answers2