Split an XML file with multiple records and invalid characters into multiple text files by element

Question

I have a set of 100K XML-ish (more on that later) legacy files with a consistent structure - an Archive wrapper with multiple Date and Data pair records.

I need to extract the individual records and write them to individual text files, but am having trouble parsing the data due to illegal characters and random CR/space/tab leading and trailing data.

About the XML Files

The files are inherited from a retired system and can't be regenerated. Each file is pretty small (less then 5 MB).

There is one Date record for every Data record:

vendor-1-records.xml

<Archive>
<Date>10 Jan 2019</Date>
<Data>Vendor 1 Record 1</Data>
<Date>12 Jan 2019</Date>
<Data>Vendor 1 Record 2</Data>
(etc)
</Archive>

vendor-2-records.xml

<Archive>
<Date>22 September 2019</Date>
<Data>Vendor 2 Record 1</Data>
<Date>24 September 2019</Date>
<Data>Vendor 2 Record 2</Data>
(etc)
</Archive>

...

vendor-100000-records.xml
<Archive>
<Date>12 April 2019</Date>
<Data>Vendor 100000 Record 1</Data>
<Date>24 October 2019</Date>
<Data>Vendor 100000 Record 2</Data>
(etc)
</Archive>

I would like to extract each Data record out and use the Date entry to define a unique file name, then write the contents of the Data record to that file as so

filename: vendor-1-record-1-2019-1Jan-10.txt contains
file contents: Vendor 1 record 1 
(no tags, just the record terminated by CR)

filename: vendor-1-record-2-2019-1Jan-12.txt contains
file contents: Vendor 1 record 2 

filename: vendor-2-record-1-2019-9Sep-22.txt contains
file contents: Vendor 2 record 1

filename: vendor-2-record-2-2019-9Sep-24.txt contains
file contents: Vendor 2 record 2

Issue 1 : illegal characters in XML Data records

One issue is that the elements contain multiple characters that XML libraries like Etree/etc terminate on including control characters, formatting characters and various Alt+XXX type characters.

I've searched online and found all manner of workaround and regex and search and replace scripts but the only thing that seems to work in Python is lxml's etree with recover=True.

However, that doesn't even always work because some of the files are apparently not UTF-8, so I get the error:

lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !

Issue 2 - Data records have random amounts of leading and following CRs and spaces

For the files I can parse with lxml.etree, the actual Data records are also wrapped in CRs and random spaces:

<Data>
(random numbers of CR + spaces and sometimes tabs)
*content<CR>*
(random numbers of CR + spaces and sometimes tabs)
</Data>

and therefore when I run

    parser = etree.XMLParser(recover=True)
    tree = etree.parse('vendor-1-records.xml', parser=parser)
    tags_needed = tree.iter('Data')
    for it in tags_needed:
        print (it.tag,it.attrib)

I get a collection of empty Data tags (one for each data record in the file) like

Data {}
Data {}

Questions

Is there a more efficient language/module than Python's lxml for ignoring the illegal characters? As I said, I've dug through a number of cookbook blog posts, SE articles, etc for pre-processing the XML and nothing seems to really work - there's always one more control character/etc that hangs the parser.

SE suggested a post about cleaning XML which references an old Atlassian tool ( Stripping Invalid XML characters in Java). I did some basic tests and it seems like it might work but open to other suggestions.

I have not used regex with Python much - any suggestions on how to handle cleaning the leading/trailing CR/space/tab randomness in the Data tags? The actual record string I want in that Data tag also has a CR at the end and may contain tabs as well so I can't just search and replace. Maybe there is a regex way to pull that but my regex-fu is pretty weak.

Parsing as text, maybe afterwards parsing as XML, does make sense here. Such as your referred atlassion XML cleaner. Or own code That way you can incrementally tackle problems like wrong encoding, illegal chars, whitespace. — Joop Eggen, Sep 14 '20 at 04:44

lonstar · Accepted Answer · 2020-09-14T06:48:55.053

For my issues 1 and 2, I kind of solved my own problem:

Issue 1 (parsing and invalid characters)
- I ran the entire set of files through the Atlassian jar referenced in (Stripping Invalid XML characters in Java) with a batch script:

for %%f in (*.xml) do (
    java -jar atlassian-xml-cleaner-0.1.jar %%f >  clean\%%~f
)

This utility standardized all of the XML files and made them parseable by lxml.

Issue 2 (CR, spaces, tabs inside the Data element)
- This configuration for lxml stripped all whitespace and handled the invalid character issue

from lxml import etree
    parser = etree.XMLParser(encoding = 'utf-8',recover=True,remove_blank_text=True)
    tree = etree.parse(filepath, parser=parser)

With these two steps I'm now able to start extracting records and writing them to individual files:

# for each date, finding the next item gives me the Data element and I can strip the tab/CR/whitespace:
for item in tree.findall('Date'):
        dt = parse_datestamp(item.text.strip())
        content = item.getnext().text.strip()

Split an XML file with multiple records and invalid characters into multiple text files by element

About the XML Files

Issue 1 : illegal characters in XML Data records

Issue 2 - Data records have random amounts of leading and following CRs and spaces

Questions

1 Answers1