1

I have a 50MB xml file and I need to read some data out of it. My approach was to use Beautifulsoup 4 since I have been using that package for some time now. This code shows how I have been doing it:

from bs4 import Beautifulsoup

# since the file is big, this line takes minutes to execute
soup = Beautifulsoup(open('myfile.xml'), 'xml')

items = soup.find_all('item')

for item in items:
    name = item['name']
    status = item.find('status').text
    description = item.find('desc').text
    refs = item.findAll('ref')
    data = []
    for ref in refs:
        if 'url' in ref.attrs:
            data.append('%s:%s' % (ref['source'], ref['url']))
        else:
            data.append('%s:%s' % (ref['source'], ref.text))

    do_something(data)

The file isn't complicated xml, I just need to read every data on every <item> entry:

<item type="CVE" name="some-name" seq="1999-0003">
  <status>Entry</status>
  <desc>A description goes here.</desc>
  <refs>
    <ref source="NAI">NAI-29</ref>
    <ref source="CERT">CA-98.11.tooltalk</ref>
    <ref source="SGI" url="example.com">Some data</ref>
    <ref source="XF">aix-ttdbserver</ref>
    <ref source="XF">tooltalk</ref>
  </refs>
</item>

This file I'm using is more likely to keep growing so it would be great to read it by chunks or not to load the whole file. I need help solving this. Maybe some other package other than BS4 is faster and is there some other package or way of avoiding loading the whole file into memory ?

PepperoniPizza
  • 8,842
  • 9
  • 58
  • 100

2 Answers2

5

You want to switch to the xml.etree.ElementTree() API here instead; it has an iterparse() iterative parsing function:

for event, elem in iterparse(source):
    if elem.tag == "record":
        # do something with the <record> element

        elem.clear()  # clean up

Since you already are using the BeautifulSoup XML mode, you already must have lxml installed. lxml implements the same API, but in C. See the lxml iterparse() documentation.

Do read Why is lxml.etree.iterparse() eating up all my memory? to make sure you clear elements properly when using lxml.

The default is to only emit end events; the whole tag has been parsed including child nodes. You can make use of this for your <item> elements:

for event, elem in iterparse(source):
    if elem.tag == "item":
        status = elem.find('status').text
        desc = elem.find('desc').text
        refs = {r.get('source'): r.text for r in elem.findall('./refs/ref')}
        elem.clear()
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Depending on the level of parsing you're doing you might also look at [`xml.sax`](https://docs.python.org/2/library/xml.sax.html) which is a streaming parser. You have to keep track of all context, however. – kindall May 20 '14 at 15:08
  • the `iterparse` looks like a good choice, I updated my question. Thanks ! – PepperoniPizza May 20 '14 at 15:26
  • What would be the correct way of opening the file for `source` I tried `io.BytesIO` and is not working. `source = io.BytesIO(open('myfile.xml', 'rb').read())` – PepperoniPizza May 20 '14 at 16:06
  • @PepperoniPizza: just pass in an open file object. `iterparse(open('myfile.xml', 'rb'))`. Or pass in a filename. The documentation does specify this. – Martijn Pieters May 20 '14 at 16:09
  • Yes I know now what is happening: `for event, elem in iterparse(source):` somehow `elem.tag` returns not only the tag name but something like this: `{http://cve.mitre.org/cve/downloads}item`, but I have it working now. Thanks. – PepperoniPizza May 20 '14 at 16:15
  • @MartijnPieters for the sake of freeing memory, shouldn't `elem.clear()` be unidented one level ? – PepperoniPizza May 21 '14 at 22:48
  • Not necessarily; we don't want to free child tags until the parent element is done for example. If there are other elements not part of the elements you are looking for, do clear those early and often. – Martijn Pieters May 21 '14 at 22:52
0

Please, have a thorough look at lxml, which is the swiss knife of XML-like data querying. It can use several engines, including BeautifoulSoup, and you can query data using XPath, and do many high level tasks within the XML file.

here's a way to parse big files, from the documentation

with open('xmlfile.xml') as f:
    for event, element in etree.iterparse(f, events=("start", "end")):
        print("%5s, %4s, %s" % (event, element.tag, element.text))

though the documentation shows many ways to interact with the parser to generate only subtrees.

zmo
  • 24,463
  • 4
  • 54
  • 90