I'm working to get some XML into JSON strings via xmltodict. Basically the XML repeats a certain set of data and I want to pull out each of these individual repeated nodes and make it a JSON string across all the XML files. I am not generating this XML, but downloading it from a third party then processing it. This is my simple code.
my_list = []
for file in os.listdir(download_path):
if file.endswith('.xml'):
with open(os.path.join(download_path, file), encoding = 'utf-8') as xml:
print(file)
things = xmltodict.parse(xml.read())
for thing in things['things']['thing']:
my_list.append(json.dumps(thing))
I'm running into ExpatError: not well-formed (invalid token):
So I investigated the XML files using Notepad++ and the problem seems to not be the usual culprits (&, <, >, etc) but instead it is control characters.
For instance, in Notepad++ I'm getting a block of STX BEL BS
where it says the error is. I've never encountered these before so after some searching I came across what they were and that they are bad news for XML.
So now the question is, how do I get rid of them or work around them? I'd like to build something into the above code that either checks the XML for these and fixes it before proceeding, or perhaps using Try
and Except
to address it when it comes up. Perhaps even pointing me towards some code that I can run on the XML files to fix them before running it through the process above (I think more than 1 file might have this issue)?
I haven't been able to find any solution yet that would allow me to fix the XML but keep it in a form I could still use with xmltodict
to eventually get some parsed data I can then pass to JSON.