2

I tried to use Python to cleanup some messy XML files, which does three things:

  1. Converting 40%-50% tag names from upper case to lower case
  2. Removing NULL between tags
  3. Removing empty rows between tags

I did this in using BeautifulSoup, however, I ran into memory issues since some of my XML files are over 1GB. Instead, I looked into some stream method like xml.sax, but I did not quite get the approach. So can anyone give me some suggestions?

xml_str = """
<DATA>

    <ROW>
        <assmtid>1</assmtid>
        <Year>1988</Year>
    </ROW>

    <ROW>
        <assmtid>2</assmtid>
        <Year>NULL</Year>
    </ROW>

    <ROW>
        <assmtid>2</assmtid>
        <Year>1990</Year>
    </ROW>

</DATA>
"""

xml_str_update = re.sub(r">NULL", ">", xml_str)
soup = BeautifulSoup(xml_str_update, "lxml")
print soup.data.prettify().encode('utf-8').strip()

Update

After some testing and taking suggestions from Jarrod Roberson, below is one possible solution.

import os
import xml.etree.cElementTree as etree
from cStringIO import StringIO

def getelements(xml_str):
    context = iter(etree.iterparse(StringIO(xml_str), events=('start', 'end')))
    event, root = next(context)

    for event, elem in context:
        if event == 'end' and elem.tag == "ROW":
            elem.tag = elem.tag.lower()
            elem.text = "\n\t\t"
            elem.tail = "\n\t"

            for child in elem:
                child.tag = child.tag.lower()
                if child.text == "NULL":
                    # if do not like self-closing tag, 
                    # add &#x200B;, which is a zero width space
                    child.text = ""  
                if child.text == None:
                    child.text = ""
                # print event, elem.tag
            yield elem
            root.clear()

with open(pth_to_output_xml, 'wb') as file:
    # start root
    file.write('<data>\n\t')
    for page in getelements(xml_str):
        file.write(etree.tostring(page, encoding='utf-8'))
    # close root
    file.write('</data>')
TTT
  • 4,354
  • 13
  • 73
  • 123
  • @alecxe- thanks for the suggestion. Does it require to explicitly list tags whose cases need to be converted? – TTT Nov 14 '15 at 02:51
  • 2
    Possible duplicate of [Why is lxml.etree.iterparse() eating up all my memory?](http://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory) –  Nov 14 '15 at 03:09

1 Answers1

1

Iterative parsing

When building an in-memory tree is not desired or practical, use an iterative parsing technique that does not rely on reading the entire source file. lxml offers two approaches: Supplying a target parser class Using the iterparse method

import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
  print event, elem

Here is a very complete tutorial on how to do this.

This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.

Then in your event handlers you just write out the transformed information as it is encountered.

Community
  • 1
  • 1
  • Thanks for the reply. However, in my case, I also need to output the whole XML file. Do you have any suggestions? Here is the latest question: http://stackoverflow.com/questions/33715226/python-cleanup-large-xml-files-using-streaming-method – TTT Nov 15 '15 at 01:12
  • The very last line of my answer tells you what you need to be doing as does the answer in the duplicate. –  Nov 15 '15 at 01:50