1

I have obtained an XML file but it is being loaded with invalid characters. I have no experience in XML, but is there a way to parse the data (perhaps with regex) to show the correct values? Or is the data corrupted?

Here is the output in the XML:

<Name>&#x0;&#x0;&#x0;&#x0;&#x0;+&#xB;&#x1;&#x4;?&#x2;?&#x0;&#x0;&#x0;&#x0;&#x0;??A~?&#x0;G~?&#x4;&#x0;&#x0;??&#x12;</Name>

The error that is being thrown is:

XML Parsing Error: reference to invalid character number

comp32
  • 201
  • 1
  • 5
  • 13
  • If someone sends you something that purports to be XML but isn't, the best thing to do is send it back to the originator for replacement, just as you would with any other faulty goods. If you want to try and repair it yourself, that's certainly feasible (just don't try to use XML tools, they won't help), but unless you have some idea what these garbage characters actually mean, it's hard to see how your repair can produce anything meaningful. – Michael Kay Feb 20 '18 at 09:43

2 Answers2

2

All character entities (&#x0; etc) in your "XML" are out of the range of allowed by the XML specification, so your data isn't really XML -- it is not well-formed.

No regex won't help.

Yes, it appears that your data is in some way wrong or corrupt as a name wouldn't typically consist of null and control characters, even if XML did allow such characters.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
2

Actually you could try also either the html.unescape function, or to replace &#<something>; with [#something;] (or similar). The first method produces only one character per bad character, but can produce with different input characters. The second method produces a sequence of character per every bad character, but sometimes it might be nice to see what did the original input look like.

Examples

from xml.etree import ElementTree as ET
import re

s = "<Name>&#x0;&#x0;&#x0;&#x0;&#x0;+&#xB;&#x1;&#x4;?&#x2;?&#x0;&#x0;&#x0;&#x0;&#x0;??A~?&#x0;G~?&#x4;&#x0;&#x0;??&#x12;</Name>"

ET.fromstring(html.unescape(s)).text
# Out: '�����+??�����??A~?�G~?��??'

# Replace &#anything123; with [#anything123;]
ET.fromstring(re.sub(r'&#([a-zA-Z0-9]+);?', r'[#\1;]', s)).text
# Out: '[#x0;][#x0;][#x0;][#x0;][#x0;]+[#xB;][#x1;][#x4;]?[#x2;]?[#x0;][#x0;][#x0;][#x0;][#x0;]??A~?[#x0;]G~?[#x4;][#x0;][#x0;]??[#x12;]'
Niko Föhr
  • 28,336
  • 10
  • 93
  • 96