I have a 50MB xml file and I need to read some data out of it. My approach was to use Beautifulsoup 4 since I have been using that package for some time now. This code shows how I have been doing it:
from bs4 import Beautifulsoup
# since the file is big, this line takes minutes to execute
soup = Beautifulsoup(open('myfile.xml'), 'xml')
items = soup.find_all('item')
for item in items:
name = item['name']
status = item.find('status').text
description = item.find('desc').text
refs = item.findAll('ref')
data = []
for ref in refs:
if 'url' in ref.attrs:
data.append('%s:%s' % (ref['source'], ref['url']))
else:
data.append('%s:%s' % (ref['source'], ref.text))
do_something(data)
The file isn't complicated xml, I just need to read every data on every <item>
entry:
<item type="CVE" name="some-name" seq="1999-0003">
<status>Entry</status>
<desc>A description goes here.</desc>
<refs>
<ref source="NAI">NAI-29</ref>
<ref source="CERT">CA-98.11.tooltalk</ref>
<ref source="SGI" url="example.com">Some data</ref>
<ref source="XF">aix-ttdbserver</ref>
<ref source="XF">tooltalk</ref>
</refs>
</item>
This file I'm using is more likely to keep growing so it would be great to read it by chunks or not to load the whole file. I need help solving this. Maybe some other package other than BS4 is faster and is there some other package or way of avoiding loading the whole file into memory ?