Stripping control characters from XML before processing

Question

I'm working to get some XML into JSON strings via xmltodict. Basically the XML repeats a certain set of data and I want to pull out each of these individual repeated nodes and make it a JSON string across all the XML files. I am not generating this XML, but downloading it from a third party then processing it. This is my simple code.

my_list = []
for file in os.listdir(download_path):
if file.endswith('.xml'):
    with open(os.path.join(download_path, file), encoding = 'utf-8') as xml:
        print(file)
        things = xmltodict.parse(xml.read())
        for thing in things['things']['thing']:
            my_list.append(json.dumps(thing))

I'm running into ExpatError: not well-formed (invalid token):

So I investigated the XML files using Notepad++ and the problem seems to not be the usual culprits (&, <, >, etc) but instead it is control characters.

For instance, in Notepad++ I'm getting a block of STX BEL BS where it says the error is. I've never encountered these before so after some searching I came across what they were and that they are bad news for XML.

So now the question is, how do I get rid of them or work around them? I'd like to build something into the above code that either checks the XML for these and fixes it before proceeding, or perhaps using Try and Except to address it when it comes up. Perhaps even pointing me towards some code that I can run on the XML files to fix them before running it through the process above (I think more than 1 file might have this issue)?

I haven't been able to find any solution yet that would allow me to fix the XML but keep it in a form I could still use with xmltodict to eventually get some parsed data I can then pass to JSON.

Relevant [why-is-elementtree-raising-a-parseerror](https://stackoverflow.com/questions/7693515/why-is-elementtree-raising-a-parseerror) — stovfl, Apr 12 '19 at 20:49
Have you tried just filtering those characters out of your XML? — CryptoFool, Apr 13 '19 at 00:47
I'd want to know why why those characters are there. Why do you think you're getting invalid XML? Could it be that they are being added during a transmission process somehow? — CryptoFool, Apr 13 '19 at 00:53
@Steve I am pretty new to all this so "just filtering those character out" while sounding simple is something I am not quite sure how to do while continuing to fit in my workflow. As far as why they are there, my guess would be that this XML is generated from a form. They are occurring in a free text section. I imagine they are artefacts from copy-paste and the code that generates the XML isn't sanitising the inputs. — ndevito1, Apr 13 '19 at 13:37
@ndevito1, you are reading in the whole document as a string via `xml.read()`. You could create a new string by filtering out the invalid characters from the input string. There are a number of ways you could do this. - @Cloudomation has an an answer that shows you how to do this assuming all characters 32 and below are bad. I don't think this is right, as this will take out newlines and tabs as well, which I don't think you want. — CryptoFool, Apr 13 '19 at 15:31
@ndevito1, see the answer I added for my opinion of how you should do this. — CryptoFool, Apr 13 '19 at 15:51

CryptoFool · Accepted Answer · 2019-04-13T16:00:28.553

Here is an answer that builds on the existing one, but does not presume to know which characters are "printable" and which are not. It leaves that to Python's standard library to determine:

nonprintable = set([chr(i) for i in range(128)]).difference(string.printable)
filtered_str = "".join([b for b in data if b not in nonprintable])

So your updated code that incorporates this would be as follows:

nonprintable = set([chr(i) for i in range(128)]).difference(string.printable)

my_list = []
for file in os.listdir(download_path):
if file.endswith('.xml'):
    with open(os.path.join(download_path, file), encoding = 'utf-8') as xml:
        print(file)
        filtered_xml = "".join([b for b in xml.read() if b not in nonprintable])
        things = xmltodict.parse(filtered_xml)
        for thing in things['things']['thing']:
            my_list.append(json.dumps(thing))

If you are talking about large XML files, you could probably do this a bit more efficiently to avoid the extra copy of the file that comes from creating an array of characters and then turning that back into a string. I wouldn't worry about this unless you actually notice a delay, or run into a memory problem. I don't think you will. If memory becomes an issue, you'd be best off doing this transformation as you read the file rather than first reading the whole file into memory.

Cloudomation · Answer 2 · 2019-04-14T05:25:52.903

0

You can filter out non-printable characters from the string:

import string


with open('bad.xml', 'r') as f:
    data = f.read()

print('Original')
for c in data:
    print(ord(c), c if c in string.printable else '')

filtered_data = ''.join(c for c in data if c in string.printable)

print('Filtered')
for c in filtered_data:
    print(ord(c), c if c in string.printable else '')

Output:

Original
2 
7 
8 
60 <
120 x
109 m
108 l
62 >
10 

60 <
47 /
120 x
109 m
108 l
62 >
10 

Filtered
60 <
120 x
109 m
108 l
62 >
10 

60 <
47 /
120 x
109 m
108 l
62 >
10

If you do not want to filter out all non-printable characters but only specific ones you can use:

filtered_data = ''.join(c for c in data if c not in (0x2, 0x7, 0x8))

In your code that could look like that:

import string

my_list = []
for file in os.listdir(download_path):
    if file.endswith('.xml'):
        with open(os.path.join(download_path, file), 'r') as xml:
            data = xml.read()
            filtered = ''.join(c for c in data if c in string.printable)
            print(file)
            things = xmltodict.parse(filtered)
            for thing in things['things']['thing']:
                my_list.append(json.dumps(thing))

edited Apr 14 '19 at 05:25

answered Apr 13 '19 at 14:28

Cloudomation

1,597
1
6
15

This will take out newlines, returns and tabs along with the "bad characters". I think this is too brute force. What is the decode thing for? – CryptoFool Apr 13 '19 at 15:36
@Steve you are right, it filtered too much. I updated my answer. [`decode()`](https://docs.python.org/3/library/stdtypes.html#bytes.decode) converts a list of bytes in an unicode string (Python 3) – Cloudomation Apr 14 '19 at 04:43
but can't we just stay in the character domain? Why the steps to go to byes and then back to characters...especially when you seem to be doing the real work in characters, and then converting to bytes just to convert back. I'm asking because I haven't done a lot of thinking about unicode, so I wouldn't be surprised if there's something to what you're doing. I just don't see it (yet). – CryptoFool Apr 14 '19 at 04:52
@Steve you are right, it also works when dealing with `str` and `chr` only. No need for `bytes`. I updated my answer – Cloudomation Apr 14 '19 at 05:27

Stripping control characters from XML before processing

2 Answers2