19

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse(). The first option I've got it to work, but it is very slow. The second option I haven't managed to get it off the ground.

Here's part of what I have:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

The result of that is only blank spaces, with no text in them.

I must be doing something wrong, but I can't grasp it. Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml. Please, help!

mvime
  • 327
  • 1
  • 2
  • 8
  • 1
    Well, the `BlogPost` tags don't seem to contain any text in them. – Lev Levitsky Mar 24 '12 at 22:30
  • True. What would be the way to get everything that's between the opening and closing BlogPost tag? – mvime Mar 24 '12 at 22:52
  • If you simply need all the info from inside the `BlogPost` tags, follow andrew's advice. If you want it HTML-formatted, apply `lxml.etree.tostring()` to them. – Lev Levitsky Mar 24 '12 at 22:56

3 Answers3

29
for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

the final clear will stop you from using too much memory.

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(etree.tostring(element))
  element.clear()

or

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([etree.tostring(child) for child in element]))
  element.clear()

or perhaps even:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([child.text for child in element]))
  element.clear()
Augustin
  • 2,444
  • 23
  • 24
andrew cooke
  • 45,717
  • 10
  • 93
  • 143
  • This works pretty much like I wanted I'll have to customize it a bit, but it's great. Thanks! – mvime Mar 24 '12 at 23:02
  • 1
    Is there a way to get everything between starting and ending "BlogPost" tags as a string? – mvime Mar 25 '12 at 00:58
  • 1
    @mvime, as what kind of string? In HTML format? Then see my comment above, `lxml.etree.tostring()` method does that. You can cut the opening and closing tag off using slice notation (see [this table](http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-bytearray-buffer-xrange)) – Lev Levitsky Mar 25 '12 at 10:13
  • 2
    should the `element.close()` be `element.clear()` in the later fragments? so long since i wrote this i no longer remember, but it looks wrong to me. – andrew cooke Dec 10 '12 at 12:01
  • 6
    I also had to parse the 1.8 GB xml file, and also using the same clear function to clear the element, But clear() actually do not remove the element from the memory and at the end you end up having root with empty elements which takes memory too. So I deleted element after parsing using "del" statement, which helped me to free memory. Read http://effbot.org/zone/element-iterparse.htm#incremental-parsing this to know what exactly happens. – Kishor Pawar Mar 17 '15 at 06:29
23

For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

^ This is not a scalable solution, especially as your source file gets larger and larger. The better solution is to get the root element, and clear that every time you load a complete record. This will keep memory usage pretty stable (sub-20MB I would say).

Here's a solution that doesn't require looking for a specific tag. This function will return a generator that yields all 1st child nodes (e.g. <BlogPost> elements) underneath the root node (e.g. <Database>). It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.

from lxml import etree

xmlfile = '/path/to/xml/file.xml'

def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()
daveruinseverything
  • 4,775
  • 28
  • 40
  • Well, quite liked the idea. But what if I need to support multiple file structures, how could I do it without finding specific tag? For example: say a have two type of xml file, in one the structure is `source->jobs->job->...`, in another, it's `jobs->job`. I want to fetch all the `job` only. How do I do it with this solution? – Ahsanul Haque Apr 21 '18 at 12:02
5

I prefer XPath for such things:

In [1]: from lxml.etree import parse

In [2]: tree = parse('/tmp/database.xml')

In [3]: for post in tree.xpath('/Database/BlogPost'):
   ...:     print 'Author:', post.xpath('Author')[0].text
   ...:     print 'Content:', post.xpath('Content')[0].text
   ...: 
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.

Doing it your way,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
     for info in element.iter():
         if info.tag in ('Author', 'Content'):
             print info.tag, ':', info.text
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
  • mm I've simplified the tree a little bit and when I try it it doesn't seem to work. The tag BlogPost for example is not simply '' but '' and the values for Owner and Status change from one entry to the other. – mvime Mar 24 '12 at 22:50
  • 1
    Additional attributes won't affect this; only the tree structure matters. To catch all the `BlogPost` elements, you can also use `for post in tree.xpath('//BlogPost'): ...` – Lev Levitsky Mar 24 '12 at 22:58
  • 1
    Thanks! I can't vote up yet, but you helped me understand how it works. The answer that I understand better and I have gotten to work is Andrew's though. – mvime Mar 24 '12 at 23:01
  • Thanks @andrew. You have mine, too, mostly for the `clear()` method that I didn't know of. – Lev Levitsky Mar 24 '12 at 23:09
  • 9
    I made a comparison recently, and `iterparse` with `clear()` consumes **much** less memory than just `XPath`. – Lev Levitsky Apr 09 '12 at 19:28
  • XPath is very nice, but note that you had to read the entire tree in first with the call to `parse()`. Which doesn't scale well for large files. I have a 3.5 GB XML file I'm working with and `parse()` fails. The `iterparse()` approach still works. – Tom Johnson Jul 05 '22 at 09:00