I have a set of 100K XML-ish (more on that later) legacy files with a consistent structure - an Archive wrapper with multiple Date and Data pair records.
I need to extract the individual records and write them to individual text files, but am having trouble parsing the data due to illegal characters and random CR/space/tab leading and trailing data.
About the XML Files
The files are inherited from a retired system and can't be regenerated. Each file is pretty small (less then 5 MB).
There is one Date record for every Data record:
vendor-1-records.xml
<Archive>
<Date>10 Jan 2019</Date>
<Data>Vendor 1 Record 1</Data>
<Date>12 Jan 2019</Date>
<Data>Vendor 1 Record 2</Data>
(etc)
</Archive>
vendor-2-records.xml
<Archive>
<Date>22 September 2019</Date>
<Data>Vendor 2 Record 1</Data>
<Date>24 September 2019</Date>
<Data>Vendor 2 Record 2</Data>
(etc)
</Archive>
...
vendor-100000-records.xml
<Archive>
<Date>12 April 2019</Date>
<Data>Vendor 100000 Record 1</Data>
<Date>24 October 2019</Date>
<Data>Vendor 100000 Record 2</Data>
(etc)
</Archive>
I would like to extract each Data record out and use the Date entry to define a unique file name, then write the contents of the Data record to that file as so
filename: vendor-1-record-1-2019-1Jan-10.txt contains
file contents: Vendor 1 record 1
(no tags, just the record terminated by CR)
filename: vendor-1-record-2-2019-1Jan-12.txt contains
file contents: Vendor 1 record 2
filename: vendor-2-record-1-2019-9Sep-22.txt contains
file contents: Vendor 2 record 1
filename: vendor-2-record-2-2019-9Sep-24.txt contains
file contents: Vendor 2 record 2
Issue 1 : illegal characters in XML Data records
One issue is that the elements contain multiple characters that XML libraries like Etree/etc terminate on including control characters, formatting characters and various Alt+XXX type characters.
I've searched online and found all manner of workaround and regex and search and replace scripts but the only thing that seems to work in Python is lxml's etree with recover=True.
However, that doesn't even always work because some of the files are apparently not UTF-8, so I get the error:
lxml.etree.XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Issue 2 - Data records have random amounts of leading and following CRs and spaces
For the files I can parse with lxml.etree, the actual Data records are also wrapped in CRs and random spaces:
<Data>
(random numbers of CR + spaces and sometimes tabs)
*content<CR>*
(random numbers of CR + spaces and sometimes tabs)
</Data>
and therefore when I run
parser = etree.XMLParser(recover=True)
tree = etree.parse('vendor-1-records.xml', parser=parser)
tags_needed = tree.iter('Data')
for it in tags_needed:
print (it.tag,it.attrib)
I get a collection of empty Data tags (one for each data record in the file) like
Data {}
Data {}
Questions
- Is there a more efficient language/module than Python's lxml for ignoring the illegal characters? As I said, I've dug through a number of cookbook blog posts, SE articles, etc for pre-processing the XML and nothing seems to really work - there's always one more control character/etc that hangs the parser.
SE suggested a post about cleaning XML which references an old Atlassian tool ( Stripping Invalid XML characters in Java). I did some basic tests and it seems like it might work but open to other suggestions.
- I have not used regex with Python much - any suggestions on how to handle cleaning the leading/trailing CR/space/tab randomness in the Data tags? The actual record string I want in that Data tag also has a CR at the end and may contain tabs as well so I can't just search and replace. Maybe there is a regex way to pull that but my regex-fu is pretty weak.