Parsing XML out of big files with Python

Question

I have a 50MB xml file and I need to read some data out of it. My approach was to use Beautifulsoup 4 since I have been using that package for some time now. This code shows how I have been doing it:

from bs4 import Beautifulsoup

# since the file is big, this line takes minutes to execute
soup = Beautifulsoup(open('myfile.xml'), 'xml')

items = soup.find_all('item')

for item in items:
    name = item['name']
    status = item.find('status').text
    description = item.find('desc').text
    refs = item.findAll('ref')
    data = []
    for ref in refs:
        if 'url' in ref.attrs:
            data.append('%s:%s' % (ref['source'], ref['url']))
        else:
            data.append('%s:%s' % (ref['source'], ref.text))

    do_something(data)

The file isn't complicated xml, I just need to read every data on every <item> entry:

<item type="CVE" name="some-name" seq="1999-0003">
  <status>Entry</status>
  <desc>A description goes here.</desc>
  <refs>
    <ref source="NAI">NAI-29</ref>
    <ref source="CERT">CA-98.11.tooltalk</ref>
    <ref source="SGI" url="example.com">Some data</ref>
    <ref source="XF">aix-ttdbserver</ref>
    <ref source="XF">tooltalk</ref>
  </refs>
</item>

This file I'm using is more likely to keep growing so it would be great to read it by chunks or not to load the whole file. I need help solving this. Maybe some other package other than BS4 is faster and is there some other package or way of avoiding loading the whole file into memory ?

Do you just need an XML parser? Or do you need to perform queries as well (in which BS would probably still be more appropriate) — MxLDevs, May 20 '14 at 15:09

score 5 · Accepted Answer · edited May 23 '17 at 12:27

5

You want to switch to the xml.etree.ElementTree() API here instead; it has an iterparse() iterative parsing function:

for event, elem in iterparse(source):
    if elem.tag == "record":
        # do something with the <record> element

        elem.clear()  # clean up

Since you already are using the BeautifulSoup XML mode, you already must have lxml installed. lxml implements the same API, but in C. See the lxml iterparse() documentation.

Do read Why is lxml.etree.iterparse() eating up all my memory? to make sure you clear elements properly when using lxml.

The default is to only emit end events; the whole tag has been parsed including child nodes. You can make use of this for your <item> elements:

for event, elem in iterparse(source):
    if elem.tag == "item":
        status = elem.find('status').text
        desc = elem.find('desc').text
        refs = {r.get('source'): r.text for r in elem.findall('./refs/ref')}
        elem.clear()

edited May 23 '17 at 12:27

Community

1
1

answered May 20 '14 at 15:07

Martijn Pieters

1,048,767
296
4,058
3,343

Depending on the level of parsing you're doing you might also look at [`xml.sax`](https://docs.python.org/2/library/xml.sax.html) which is a streaming parser. You have to keep track of all context, however. – kindall May 20 '14 at 15:08
the `iterparse` looks like a good choice, I updated my question. Thanks ! – PepperoniPizza May 20 '14 at 15:26
What would be the correct way of opening the file for `source` I tried `io.BytesIO` and is not working. `source = io.BytesIO(open('myfile.xml', 'rb').read())` – PepperoniPizza May 20 '14 at 16:06
@PepperoniPizza: just pass in an open file object. `iterparse(open('myfile.xml', 'rb'))`. Or pass in a filename. The documentation does specify this. – Martijn Pieters May 20 '14 at 16:09
Yes I know now what is happening: `for event, elem in iterparse(source):` somehow `elem.tag` returns not only the tag name but something like this: `{http://cve.mitre.org/cve/downloads}item`, but I have it working now. Thanks. – PepperoniPizza May 20 '14 at 16:15
@MartijnPieters for the sake of freeing memory, shouldn't `elem.clear()` be unidented one level ? – PepperoniPizza May 21 '14 at 22:48
Not necessarily; we don't want to free child tags until the parent element is done for example. If there are other elements not part of the elements you are looking for, do clear those early and often. – Martijn Pieters May 21 '14 at 22:52

zmo · Answer 2 · 2014-05-20T15:52:41.223

0

Please, have a thorough look at lxml, which is the swiss knife of XML-like data querying. It can use several engines, including BeautifoulSoup, and you can query data using XPath, and do many high level tasks within the XML file.

here's a way to parse big files, from the documentation

with open('xmlfile.xml') as f:
    for event, element in etree.iterparse(f, events=("start", "end")):
        print("%5s, %4s, %s" % (event, element.tag, element.text))

though the documentation shows many ways to interact with the parser to generate only subtrees.

edited May 20 '14 at 15:52

answered May 20 '14 at 15:12

zmo

24,463
4
54
90

Does this way of handling the file will avoid loading it all to memory ? And I updated my question, xml is not complicated, how could I do it with lxml ? – PepperoniPizza May 20 '14 at 15:27
1

and has mistakes, `from lxml import etree` and `etree.parse(StringIO(xml))` doesn't make sense – PepperoniPizza May 20 '14 at 15:40
indeed, sorry, fixed :-) – zmo May 20 '14 at 15:45
ok, edited this time to show off only a relevant example. Though, it's basically pretty much what you said, @MartijnPieters ;-) – zmo May 20 '14 at 15:53
Well, I said a few more things than that. But yeah. – Martijn Pieters May 20 '14 at 15:58

Parsing XML out of big files with Python

2 Answers2