5

I have a SOAP client in Python receiving a response, which, in one element of the SOAP envelope's body receives a large stream of data (gzipped file, several GBs, machine's main memory not necessarily big enough to hold it).

Thus, I am required to process the information as a stream, which I am specifying when posting my SOAP request:

import requests
url     = 'xyz.svc'
headers = {'SOAPAction': X}
response = requests.post(url, data=payload, headers=headers, stream=True)

In order to parse the information with lxml.etree (I have to first read some information from the header and then process the fields from the body, including the large file element), I want to now use the stream to feed iterparse:

from lxml import etree
context = etree.iterparse(response.raw, events = ('start', 'end'))
for event, elem in context:
    if event == 'start':
        if elem.tag == t_header:
            # process header
        if elem.tag == t_body:
            # TODO: write element text to file, rely on etree to not load into memory?
    else:
        # do some cleanup

Unfortunately, structuring the code like this does not appear to work, passing response.raw to iterparse raises:

XMLSyntaxError: Document is empty, line 1, column 1
  1. How can this be fixed? Searching for similar solutions, this approach generally seems to work for people.

Solution: etree received an encoded bytestream which it did not handle properly, setting

response.raw.decode_content = True

seems to work for the first part.

  1. How to properly implement streaming the element text? Will etree.iterparse process the full element.text into memory, or can it be read in chunks?

Update: With the parser working (I am not sure, however, whether the decoding is done properly, since the files appear to end up corrupted - which is probably a result of bad base64 decoding), this remains to be properly implemented.

Update 2: Some additional debugging showed that it is not possible to access all the content in the large element's text. Printing

len(elem.text)

in the start-event shows 33651, while in the end-event, it's 10389304. Any ideas how to read the full content iteratively with lxml?

Regarding package versions, I am using:

  • requests 2.9.1
  • python 3.4.4
  • etree 3.4.4 (from lxml)
sim
  • 1,227
  • 14
  • 20
  • What happens if you use the `end` event instead of the `start` event? – DocZerø Jul 22 '16 at 14:25
  • that unfortunately did not change anything (other than that, I believe that if it worked, it would read the entire content of the element in question, which would be too much to keep in memory). At least I managed to get the parsing working by setting `response.raw.decode_content = True`. – sim Jul 22 '16 at 15:39
  • Well if it is gzipped data then you won't be reading it with iterparse – Padraic Cunningham Jul 22 '16 at 18:00
  • Also happy with any alternative suggestions that allow for reasonably robust processing without loading the full XML element content into memory. Would SAX work? I have never used this myself. Maybe the best solution would be to just write the parser myself, but pure Python IO tends to be a bit slow. – sim Jul 22 '16 at 23:23
  • 2
    There's a solution for this at https://stackoverflow.com/questions/52989143/fully-streaming-xml-parser – Erik Cederstrand Nov 27 '18 at 15:42

1 Answers1

0

you need to set encoding.

This works:

def fast_iter(url):
    response = requests.post(url, stream=True)
    response.raw.decode_content = True
    context = etree.iterparse(response.raw, html=True, events={"end"})
    for event, elem in context:
        if event == 'end':
            yield elem.text
Jose G
  • 333
  • 1
  • 2
  • 11