9

I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I've been trying to use iterparse from python's ElementTree, but I'm confused about how to modify the tree and then write the resulting tree into a new XML file. I've read the documentation on itertree but it hasn't cleared things up. Are there any simple ways to do this?

Thank you!

EDIT: Here's what I have so far.

import xml.etree.ElementTree as ET
import re 

date_pages = []
f=open('dates_texts.xml', 'w+')

tree = ET.iterparse("sample.xml")

for i, element in tree:
    if element.tag == 'page':
        for page_element in element:
            if page_element.tag == 'revision':
                for revision_element in page_element:
                    if revision_element.tag == '{text':
                        if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0:
                            element.clear()
LateCoder
  • 2,163
  • 4
  • 25
  • 44
  • Could you show the code from your attempt (even if it's incomplete)? Helping you fix it instead of writing something from scratch would save time. – Asad Saeeduddin Mar 14 '13 at 02:07
  • Added the code to my question, above. – LateCoder Mar 15 '13 at 02:56
  • I spotted that earlier. Sorry, I've been busy with other stuff, but I promise I'll take a look soon. In the meantime, I've brought up your question on chat to bring it some more attention. – Asad Saeeduddin Mar 15 '13 at 02:58
  • How do you know that it doesn't work? Do you get an exception? A good idea is to use a small xml file instead of your 40GB to see if the behaviour is correct, before trying the big file. – wrgrs Mar 15 '13 at 09:43
  • 1
    It's not that the current behavior doesn't work. What I have right now is fine, but it's only a parser. I need a way to write the modified xml back out. – LateCoder Mar 15 '13 at 21:19

2 Answers2

8

If you have a large xml that doesn't fit in memory then you could try to serialize it one element at a time. For example, assuming <root><page/><page/><page/>...</root> document structure and ignoring possible namespace issues:

import xml.etree.cElementTree as etree

def getelements(filename_or_file, tag):
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

with open('output.xml', 'wb') as file:
    # start root
    file.write(b'<root>')

    for page in getelements('sample.xml', 'page'):
        if keep(page):
            file.write(etree.tostring(page, encoding='utf-8'))

    # close root
    file.write(b'</root>')

where keep(page) returns True if page should be kept e.g.:

import re

def keep(page):
    # all <revision> elements must have 20xx in them
    return all(re.search(r'20\d\d', rev.text)
               for rev in page.iterfind('revision'))

For comparison, to modify a small xml file, you could:

# parse small xml
tree = etree.parse('sample.xml')

# remove some root/page elements from xml
root = tree.getroot()
for page in root.findall('page'):
    if not keep(page):
        root.remove(page) # modify inplace

# write to a file modified xml tree
tree.write('output.xml', encoding='utf-8')
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Is there a way to get the library to print out `` and `` for you, preserving attributes and, e.g., namespace declarations in the start tag while not keeping the root element in memory? – binki Jan 13 '16 at 21:11
  • @binki: do you see `root` variable in `getelements()`? What do you think it refers to? – jfs Jan 14 '16 at 06:40
  • Just why do you have `file.write(b'')` then? – binki Jan 14 '16 at 06:50
  • @binki: for simplicity (I assume 40GB xml files, do not contain anything useful in ). – jfs Jan 14 '16 at 06:55
  • @gardenhead what does the comment say? – jfs Oct 06 '16 at 22:41
  • @J.F.Sebastian Yes, but why do you need to clear the root on *every* iteration? Shouldn't it just be once, after the whole operation is complete? – gardenhead Oct 06 '16 at 23:45
  • @gardenhead the xml document is assumed to be larger than the available memory – jfs Oct 07 '16 at 00:03
  • @jfs - nice solution, it works a treat. One bit I don't understand (and have checked the docs, but nothing) is `_, root = next(context)`. Could you explain how that works please? I gather that it's setting each 'context'/element as the root for each iterations (is that correct?), but I don't understand what the underscore does. – ron_g Jun 12 '18 at 15:38
  • 1
    @rong `_` is just a throwaway name. `a, b = [1,2]` binds `a` name to `1` int object and `b` name to `2`. – jfs Jun 12 '18 at 18:56
1

Perhaps the answer to my similar question can help you out.

As for how to write this back to an .xml file, I ended up doing this at the bottom of my script:

with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original
    for line in ET.tostring(doc):
        t.write(line)
    t.close
print('File.xml Complete') # Console message that file wrote successfully, can be omitted

The variable doc is from earlier on in my script, comparable to where you have tree = ET.iterparse("sample.xml") I have this:

doc = ET.parse(filename)

I've been using lxml instead of ElementTree but I think the write out part should still work (I think it's mainly just xpath stuff that ElementTree can't handle.) I'm using lxml imported with this line:

from lxml import etree as ET

Hopefully this (along with my linked question for some additional code context if you need it) can help you out!

Community
  • 1
  • 1
Qanthelas
  • 514
  • 2
  • 8
  • 14
  • 1
    To write `tree = ET.parse(source)` to a file after you've modified it, you could use: `tree.write('File.xml')`. Note: your code `for c in ET.tostring(doc)` writes *one* character at a time. If you want to use `ET.tostring()`; you could write it all at once `t.write(ET.tostring(doc))`. `with` statement closes the file automatically, you don't need `t.close()` inside it. See examples in my answer on [how to write both large and small xml files](http://stackoverflow.com/a/15457389/4279) – jfs Mar 17 '13 at 04:09