Parse XML streaming using XMLPullParser in Python

Question

this is my first time asking a question here and I'm a newbie so I'm sorry if my question sounds stupid for some. I am working on streaming data from a machine:

requests.get('http://IP:port/sample?interval=0&heartbeat=1000', stream = True)

and I am receiving data in XML. This the structure of the XML data :

b'--9bc1ad19bf9e3b4049ab7e4f78dda451'
b'Content-type: text/xml'
b'Content-length: 15560'
b'<?xml version="1.0" encoding="UTF-8"?>'
b'<MTConnectStreams xmlns:m="urn:mtconnect.org:MTConnectStreams:1.3"  xmlns="urn:mtconnect.org:MTConnectStreams:1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:mtconnect.org:MTConnectStreams:1.3 http://www.mtconnect.org/schemas/MTConnectStreams_1.3.xsd">'
b'  <Header creationTime="2016-12-01T17:58:48Z" sender="MAZATROL-PC" instanceId="1480604825" version="1.3.0.17" bufferSize="131072" nextSequence="1301" firstSequence="1" lastSequence="42044"/>'
b'  <Streams>'
b'    <DeviceStream name="Mazak" uuid="Mazak">'
b'      <ComponentStream component="Controller" name="controller" componentId="cont">'
b'        <Samples>'
b'          <AccumulatedTime dataItemId="yltime" timestamp="2016-12-01T15:45:15.662995Z" name="total_time" sequence="1214" subType="x:TOTAL">3104040</AccumulatedTime>'
b'          <AccumulatedTime dataItemId="yltime" timestamp="2016-12-01T15:46:16.452858Z" name="total_time" sequence="1243" subType="x:TOTAL">3104101</AccumulatedTime>'
b'          <AccumulatedTime dataItemId="yltime" timestamp="2016-12-01T15:47:17.331808Z" name="total_time" sequence="1272" subType="x:TOTAL">3104162</AccumulatedTime>'
b'          <PathFeedrateOverride dataItemId="pfo" timestamp="2016-12-01T15:33:27.042482Z" name="Fovr" sequence="899" subType="ACTUAL">0</PathFeedrateOverride>'
b'          <PathFeedrateOverride dataItemId="pfr" timestamp="2016-12-01T15:30:26.700817Z" name="Frapidovr" sequence="803" subType="RAPID">0</PathFeedrateOverride>'
b'          <PathFeedrateOverride dataItemId="pfr" timestamp="2016-12-01T15:30:42.685031Z" name="Frapidovr" sequence="810" subType="RAPID">0</PathFeedrateOverride>'

I am only interested in getting some information from the lines that contain dataItemId. I did this just to print the data :

for line in r.iter_lines():
    if b'dataItemId' in line:
            print(line)

Knowing that speed is really crucial since we want to have real time data accessible on an AWS database. I am lost on how I should parse in the best way. From what I found, using XmlPullParser is the best way to parse streaming data without blocking. However, I don't know what should the 'start' and 'end' be. I am really lost on how I should proceed without losing any data and guaranteeing that I am parsing everything. I was thinking about having a thread that receives the data, another one that parses the data using XmlPullParser, once the data is put on json format and sent, the line is deleted from the tree. But since I don't have a tree structure with child nodes if I want to only parse the lines that have dataItemId, I'm not seeing clearly how it should work. Your help is highly appreciated. Thank you

You're looking to pull only a certain tag. What do you mean what the 'start' and 'end' be? Do you know what appears when you should end parsing? ie can you mark when your file ends? — themistoklik, Dec 01 '16 at 20:33
As we're streaming, we're collecting the data as long as the machine is running. So unless the machine is off, I'm keeping on receiving data and parsing it. I was referring to this : parser = etree.XMLPullParser(events=('start', 'end')) — Wafa, Dec 01 '16 at 21:24
You tried start event to be the tag you wanted and end to be empty string and failed? What happens if you set those two as events? Also have you heard of SAX parsing? — themistoklik, Dec 01 '16 at 22:07
dataItemId is not a tag right ? From what I understand is that a tag is what appears after < . Correct me if I'm wrong. I will try to do what you suggested, but the tag can change since the server sends data only when it has new one. Sure what I'm sure of is that there should be dataItemId. How can I put this as start event ? — Wafa, Dec 01 '16 at 22:27
As the tag you wanted I mean for example if you can be sure that all items you want will fall under that tag. It's always the same format right? For an end tag I'd choose one that always appears after the content I want. You also mention deletions from a tree. Do you want to keep the whole XML in memory or just extract the data you want? — themistoklik, Dec 01 '16 at 22:43
The two tags I'm sure contain the data I need are either sample or event. I don't know if it's possible to use or as a start event. — Wafa, Dec 01 '16 at 22:45
As it's streaming, storing everything might not be the best option. so once the data is processed which means sent to the cloud database, there is no need for keeping it. — Wafa, Dec 01 '16 at 22:46

score 1 · Answer 1 · edited May 23 '17 at 12:13

In lieu of an answer using the library you want let me point you to another similar direction. Since you're fishing for two specific tags a simple approach would be like in this post, only your check would have to be

if element.tag=="tag1" or element.tag=="tag2"

You could also check SAX and follow the same logic. If you're doing this with speed in mind, profile it and keep the implementation that best suits your needs speed and space-wise.

Also see this post

Parse XML streaming using XMLPullParser in Python

1 Answers1