10

I'm trying to remove all the project1 nodes (along with their child elements) from the below sample xml document (original document is about 30 GB) using SAX parser.It would be fine to have a separate modified file or ok with the in-line edit.

sample.xml

<ROOT>
    <test src="http://dfs.com">Hi</test>
    <project1>This is old data<foo></foo></project1>
    <bar>
        <project1>ty</project1>
        <foo></foo>
    </bar>
</ROOT>

Here is my attempt..

parser.py

from xml.sax.handler import ContentHandler
import xml.sax

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self, out_file):
        self._charBuffer = []
        self._result = []
        self._out = open(out_file, 'w')

    def _createElement(self, name, attrs):
        attributes = attrs.items()
        if attributes:
            out = ''
            for key, value in attributes:
                out += ' {}={}'.format(key, value)
            return '<{}{}>'.format(name, out)
        return '<{}>'.format(name)


    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        self._out.write(data.strip()) #remove strip() if whitespace is important

    def parse(self, f):
        xml.sax.parse(f, self)

    def characters(self, data):
        self._charBuffer.append(data)

    def startElement(self, name, attrs):
        if not name == 'project1': 
            self._result.append({})
            self._out.write(self._createElement(name, attrs))

    def endElement(self, name):
        if not name == 'project1': self._result[-1][name] = self._getCharacterData()

MyHandler('out.xml').parse("sample.xml")

I can't make it to work.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • What's a problem to process data as text? Simply: check flag, is it down, grab line, is it project1, raise flag, write/append or not, repeat... Just an outline of strategy –  Feb 19 '17 at 17:58
  • But this approach will results in loading the whole file into memory. – Avinash Raj Feb 20 '17 at 05:57
  • I mean: read line - process line - update state - decide write or not. Don't work with whole file at once. There is no need. –  Feb 20 '17 at 06:21
  • u can even use buffer to reduce write count. For example, flush buffer only every 1000 lines. Measure it by yourself if it's important. –  Feb 20 '17 at 06:24
  • What u r doing now is over complicated. SAX parsing good for some situations, but it's just an abstraction over simply reading xml file line by line and dealing with events (startElement, endElement). Every time the bunch of objects would be created, and then u should grab data and produce new bunch of objects just to write this data to file. –  Feb 20 '17 at 06:44
  • This was the first task given, there are many tasks following up which deals with modifying xml like modifying the attributes of a specific element, etc. So that I thought it would be better if I get a sax based answer. – Avinash Raj Feb 20 '17 at 06:49
  • `elementtree.iterparse` is easier to use, and allows good control over the objects created by the parser. – cco Feb 20 '17 at 07:31
  • @cco I saw solutions using iterparse which does only the parsing job but I don't find any regarding parsing and writing serially – Avinash Raj Feb 20 '17 at 07:33
  • 4
    @ar7max: The problem with processing XML as text is well know -- it leads to brittle solutions that break in a myriad ways when perfectly reasonable variations in the XML occur. Please do not make such recommendations. Thanks. – kjhughes Feb 25 '17 at 17:26
  • Yesterday I filtered XML using simple text processing - nothing broke. Wanna know why? 1) It's text file 2) I know how tags works 3) Little magic 4) Do you know how parsers works? They r reading TEXT. Now he needs to remove redundant elements from XML. Wanna use SAX(or similar)? Use SAX(or similar) (dont even know why my SAX solution recieved -1 from u, mb its a kind of joke). Do u need to use SAX (or similar)? No. Wanna know why? goto 1 –  Feb 25 '17 at 18:20

1 Answers1

6

You could use a xml.sax.saxutils.XMLFilterBase implementation to filter out your project1 nodes.

Instead of assembling the xml strings yourself you could use xml.sax.saxutils.XMLGenerator.

The following is Python3 code, adjust super if you require Python2.

from xml.sax import make_parser
from xml.sax.saxutils import XMLFilterBase, XMLGenerator


class Project1Filter(XMLFilterBase):
    """This decides which SAX events to forward to the ContentHandler

    We will not forward events when we are inside any elements with a
    name specified in the 'tags_names_to_exclude' parameter
    """

    def __init__(self, tag_names_to_exclude, parent=None):
        super().__init__(parent)

        # set of tag names to exclude
        self._tag_names_to_exclude = tag_names_to_exclude

        # _project_1_count keeps track of opened project1 elements
        self._project_1_count = 0

    def _forward_events(self):
        # will return True when we are not inside a project1 element
        return self._project_1_count == 0

    def startElement(self, name, attrs):
        if name in self._tag_names_to_exclude:
            self._project_1_count += 1

        if self._forward_events():
            super().startElement(name, attrs)

    def endElement(self, name):
        if self._forward_events():
            super().endElement(name)

        if name in self._tag_names_to_exclude:
            self._project_1_count -= 1

    def characters(self, content):
        if self._forward_events():
            super().characters(content)

    # override other content handler methods on XMLFilterBase as neccessary


def main():
    tag_names_to_exclude = {'project1', 'project2', 'project3'}
    reader = Project1Filter(tag_names_to_exclude, make_parser())

    with open('out-small.xml', 'w') as f:
        handler = XMLGenerator(f)
        reader.setContentHandler(handler)
        reader.parse('input.xml')


if __name__ == "__main__":
    main()
Jeremy Allen
  • 6,434
  • 2
  • 26
  • 31