Parsing compressed xml feed into ElementTree

Question

I'm trying to parse the following feed into ElementTree in python: "http://smarkets.s3.amazonaws.com/oddsfeed.xml" (warning large file)

Here is what I have tried so far:

feed = urllib.urlopen("http://smarkets.s3.amazonaws.com/oddsfeed.xml")

# feed is compressed
compressed_data = feed.read()
import StringIO
compressedstream = StringIO.StringIO(compressed_data)
import gzip
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()

# Parse XML
tree = ET.parse(data)

but it seems to just hang on compressed_data = feed.read(), infinitely maybe?? (I know it's a big file, but seems too long compared to other non-compressed feeds I parsed, and this large is killing any bandwidth gains from the gzip compression in the first place).

Next I tried requests, with

url = "http://smarkets.s3.amazonaws.com/oddsfeed.xml"
headers = {'accept-encoding': 'gzip, deflate'}
r = requests.get(url, headers=headers, stream=True)

but now

tree=ET.parse(r.content)

or

tree=ET.parse(r.text)

but these raise exceptions.

What's the proper way to do this?

"these raise exceptions" isn't helpful. _What_ exceptions? Copy and paste the traceback into your question. — abarnert, Oct 18 '14 at 00:26
Also, why are you trying to pass an HTTP header as POST data? They're not the same thing. — abarnert, Oct 18 '14 at 00:26

jfs · Answer 1 · 2015-06-11T11:38:39.687

7

You can pass the value returned by urlopen() directly to GzipFile() and in turn you can pass it to ElementTree methods such as iterparse():

#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request

with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
                     headers={"Accept-Encoding": "gzip"})) as response, \
     GzipFile(fileobj=response) as xml_file:
    for elem in getelements(xml_file, 'interesting_tag'):
        process(elem)

where getelements() allows to parse files that do not fit in memory.

def getelements(filename_or_file, tag):
    """Yield *tag* elements from *filename_or_file* xml incrementaly."""
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # free memory

To preserve memory, the constructed xml tree is cleared on each tag element.

edited Jun 11 '15 at 11:38

answered Oct 18 '14 at 00:43

jfs

399,953
195
994
1,670

Could you elaborate a bit on `root.clear()` ? How does this free the memory for each `elem` ? – Mr_and_Mrs_D Sep 26 '18 at 19:27
@Mr_and_Mrs_D the last sentence says what `root.clear()` does and why it is used. – jfs Sep 26 '18 at 19:29
Thanks - what confuses me is that root seems defined once - I would expect something like `elem.clear()` – Mr_and_Mrs_D Sep 26 '18 at 19:42
@Mr_and_Mrs_D: here's [on the difference between `elem.clear()` and `root.clear()`](http://effbot.org/zone/element-iterparse.htm#incremental-parsing) (in short: unless the xml file is huge, we can ignore it). Here's [on the performance difference](https://stackoverflow.com/a/7943376/4279) – jfs Sep 26 '18 at 19:50

score 3 · Accepted Answer · answered Oct 18 '14 at 00:28

The ET.parse function takes "a filename or file object containing XML data". You're giving it a string full of XML. It's going to try to open a file whose name is that big chunk of XML. There is probably no such file.

You want the fromstring function, or the XML constructor.

Or, if you prefer, you've already got a file object, gzipper; you could just pass that to parse instead of reading it into a string.

This is all covered by the short Tutorial in the docs:

We can import this data by reading from a file:

import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()

Or directly from a string:

root = ET.fromstring(country_data_as_string)

Parsing compressed xml feed into ElementTree

2 Answers2

Linked