I tried to use Python to cleanup some messy XML files, which does three things:
- Converting 40%-50% tag names from upper case to lower case
- Removing NULL between tags
- Removing empty rows between tags
I did this in using BeautifulSoup
, however, I ran into memory issues since some of my XML files are over 1GB. Instead, I looked into some stream method like xml.sax
, but I did not quite get the approach. So can anyone give me some suggestions?
xml_str = """
<DATA>
<ROW>
<assmtid>1</assmtid>
<Year>1988</Year>
</ROW>
<ROW>
<assmtid>2</assmtid>
<Year>NULL</Year>
</ROW>
<ROW>
<assmtid>2</assmtid>
<Year>1990</Year>
</ROW>
</DATA>
"""
xml_str_update = re.sub(r">NULL", ">", xml_str)
soup = BeautifulSoup(xml_str_update, "lxml")
print soup.data.prettify().encode('utf-8').strip()
Update
After some testing and taking suggestions from Jarrod Roberson, below is one possible solution.
import os
import xml.etree.cElementTree as etree
from cStringIO import StringIO
def getelements(xml_str):
context = iter(etree.iterparse(StringIO(xml_str), events=('start', 'end')))
event, root = next(context)
for event, elem in context:
if event == 'end' and elem.tag == "ROW":
elem.tag = elem.tag.lower()
elem.text = "\n\t\t"
elem.tail = "\n\t"
for child in elem:
child.tag = child.tag.lower()
if child.text == "NULL":
# if do not like self-closing tag,
# add ​, which is a zero width space
child.text = ""
if child.text == None:
child.text = ""
# print event, elem.tag
yield elem
root.clear()
with open(pth_to_output_xml, 'wb') as file:
# start root
file.write('<data>\n\t')
for page in getelements(xml_str):
file.write(etree.tostring(page, encoding='utf-8'))
# close root
file.write('</data>')