3

Python beginner needs help filtering .xml files. I've been trying with xml.etree.ElementTree having little success.

The xml looks like this:

<ClientData>
  <Report>
    <ReportHost>
      <ReportItem pluginID="11111">

        Ipsum lorem etc leviosa!

      </ReportItem>
    </ReportHost>
    <ReportHost>
      <ReportItem pluginID="22222">

        Sed ut perspiciatis unde omnis iste

      </ReportItem>
    </ReportHost>
  </Report>
</ClientData>

If the ReportItem.pluginID matches an item on a blacklist, I would like to remove the entire element (ReportItem) along with its children, then write the filtered .xml. Thanks!

Edit - Here's what I have so far but I'm not sure how to get it to work with this level of nesting:

from xml.etree.ElementTree import ElementTree

tree = ElementTree()

# Test input
tree.parse("test.xml")

for node in tree.findall('ReportItem'):
    if tag.attrib['pluginID']=='11111':
        tree.remove(node)

tree.write('test_out.xml')
mpace
  • 93
  • 1
  • 6

2 Answers2

1

I really suggest using the lxml module. There are no references to parent elements when using Python's xml module. I think you will have a much easier time using lxml.

hamiltoncw
  • 23
  • 4
1

This is what I ended up developing. I found it has memory issues filtering a file larger than 600MB, and maybe smaller. From what I've read, there are ways to handle memory better than parsing the whole xml, but I don't have time to test.

import lxml.etree as le
import os
from optparse import OptionParser, SUPPRESS_HELP

def removeVulns(filename, pluginlist):
    output_file = open("temp.xml","wb")
    with open(filename,'r') as f:
        doc=le.parse(f)
        for nessusID in open(pluginlist):
            for elem in doc.xpath('//*[attribute::pluginID]'):
                if elem.attrib['pluginID']==nessusID.strip('\n'):
                    parent=elem.getparent()
                    parent.remove(elem)
        output_file.write(le.tostring(doc))
        f.close()
        output_file.close()
        os.remove(filename)
        os.rename('temp.xml', filename)


def main():
    parser = OptionParser(usage='%prog -f <filename>', 
                            version='%prog 1.0')   
    parser.add_option('-f',
                      dest='name',
                      type='string',
                      help='.nessus name')


    (options, args) = parser.parse_args()
    if not options.name:
        parser.error('Pop, you forgot name!')
    removeVulns(options.name, 'pluginlist.txt')

if __name__ == "__main__":
    main()
mpace
  • 93
  • 1
  • 6
  • you can remove the `if` check by moving the logic into XPath like so: `for elem in doc.xpath('//*[@pluginID="{0}"]' % nessusID.strip('\n')):` – har07 Apr 01 '16 at 03:05