I have a SOAP client in Python receiving a response, which, in one element of the SOAP envelope's body receives a large stream of data (gzipped file, several GBs, machine's main memory not necessarily big enough to hold it).
Thus, I am required to process the information as a stream, which I am specifying when posting my SOAP request:
import requests
url = 'xyz.svc'
headers = {'SOAPAction': X}
response = requests.post(url, data=payload, headers=headers, stream=True)
In order to parse the information with lxml.etree (I have to first read some information from the header and then process the fields from the body, including the large file element), I want to now use the stream to feed iterparse:
from lxml import etree
context = etree.iterparse(response.raw, events = ('start', 'end'))
for event, elem in context:
if event == 'start':
if elem.tag == t_header:
# process header
if elem.tag == t_body:
# TODO: write element text to file, rely on etree to not load into memory?
else:
# do some cleanup
Unfortunately, structuring the code like this does not appear to work, passing response.raw to iterparse raises:
XMLSyntaxError: Document is empty, line 1, column 1
- How can this be fixed? Searching for similar solutions, this approach generally seems to work for people.
Solution: etree received an encoded bytestream which it did not handle properly, setting
response.raw.decode_content = True
seems to work for the first part.
- How to properly implement streaming the element text? Will etree.iterparse process the full element.text into memory, or can it be read in chunks?
Update: With the parser working (I am not sure, however, whether the decoding is done properly, since the files appear to end up corrupted - which is probably a result of bad base64 decoding), this remains to be properly implemented.
Update 2: Some additional debugging showed that it is not possible to access all the content in the large element's text. Printing
len(elem.text)
in the start
-event shows 33651
, while in the end
-event, it's 10389304
. Any ideas how to read the full content iteratively with lxml?
Regarding package versions, I am using:
- requests 2.9.1
- python 3.4.4
- etree 3.4.4 (from lxml)