Transforming XML response into pandas dataframe

Question

Been trying to structure an API response from XML to a readable pandas dataframe. I found a lot of inspiration in earlier threads on this topic, but the values in my dataframe still displays as "None".

The XML response:

<VehiclePositionResponse xmlns="http://fms-standard.com/rfms/v1.0.0/xsd/position" xmlns:ns2="http://fms-standard.com/rfms/v1.0.0/xsd/common/position">
    <VehiclePosition>
        <VIN>YS2R8X40005440923</VIN>
        <TriggerType>OTHER</TriggerType>
        <CreatedDateTime>2019-07-31T16:50:28</CreatedDateTime>
        <ReceivedDateTime>2019-07-31T16:50:29</ReceivedDateTime>
        <GNSSPosition>
            <ns2:Latitude>62.098339</ns2:Latitude>
            <ns2:Longitude>10.542222</ns2:Longitude>
            <ns2:Heading>291</ns2:Heading>
            <ns2:Altitude>655</ns2:Altitude>
            <ns2:Speed>0</ns2:Speed>
            <ns2:PositionDateTime>2019-07-31T16:50:28</ns2:PositionDateTime>
        </GNSSPosition>
        <WheelBasedSpeed></WheelBasedSpeed>
    </VehiclePosition>
    <VehiclePosition>
        <VIN>YS2R8X40005441367</VIN>
        <TriggerType>OTHER</TriggerType>
        <CreatedDateTime>2019-07-31T18:13:24</CreatedDateTime>
        <ReceivedDateTime>2019-07-31T18:13:25</ReceivedDateTime>
        <GNSSPosition>
            <ns2:Latitude>62.127206</ns2:Latitude>
            <ns2:Longitude>10.608676</ns2:Longitude>
            <ns2:Heading>3</ns2:Heading>

etc.

Code:

headers={'Authorization':Token,'Content-Type':'application/xml'}
r=requests.get(url, headers=headers)


def getvalueofnode(node):
    return node.text if node is not None else None

def main():
   root = cET.fromstring(r.content)
   dfcols = ['VIN', 'CreatedDateTime', 'ReceivedDateTime', 'Latitude', 'Longitude', 'Altitude']
   df_xml = pd.DataFrame(columns=dfcols)

   for node in root:
      VIN = node.find('VIN')
      CreatedDateTime = node.find('CreatedDateTime')
      ReceivedDateTime = node.find('ReceivedDateTime')
      Latitude = node.find('Latitude')
      Longitude = node.find('Longitude')
      Altitude = node.find('Altitude')

      df_xml = df_xml.append(
         pd.Series([getvalueofnode(VIN), getvalueofnode(CreatedDateTime), getvalueofnode(ReceivedDateTime), getvalueofnode(Latitude), getvalueofnode(Longitude), getvalueofnode(Altitude)], index=dfcols),
      ignore_index=True)

   print(df_xml)

main()

This is how my response looks like: Current Dataframe

score 1 · Accepted Answer · answered Aug 02 '19 at 03:39

Essentially, you are not accounting for the namespaces in the XML located in root tag and likely the reason for all None results. Consider parsing with namespaces defined. Since one is the default namespace, give it any prefix like data and parse with it:

ns = {"doc":"http://fms-standard.com/rfms/v1.0.0/xsd/position",
      "ns2":"http://fms-standard.com/rfms/v1.0.0/xsd/common/position"}

for node in root:
   VIN = node.find("doc:VIN", ns)
   CreatedDateTime = node.find('doc:CreatedDateTime', ns)
   ReceivedDateTime = node.find('doc:ReceivedDateTime', ns)
   Latitude = node.find('doc:GNSSPosition/ns2:Latitude', ns)
   Longitude = node.find('doc:GNSSPosition/ns2:Longitude', ns)
   Altitude = node.find('doc:GNSSPosition/ns2:Altitude', ns)

Additionally, avoid the quadratic copy of calling append in a loop. Instead, build a list of dictionaries to bind into a DataFrame() constructor.

def main2():
   root = cET.fromstring(r.content)

   ns = {"doc":"http://fms-standard.com/rfms/v1.0.0/xsd/position",
         "ns2":"http://fms-standard.com/rfms/v1.0.0/xsd/common/position"}

   data_list = [{'VIN': getvalueofnode(node.find("doc:VIN", ns)),
                 'CreatedDateTime': getvalueofnode(node.find('doc:CreatedDateTime', ns)),
                 'ReceivedDateTime': getvalueofnode(node.find('doc:ReceivedDateTime', ns)),
                 'Latitude': getvalueofnode(node.find('doc:GNSSPosition/ns2:Latitude', ns)),
                 'Longitude': getvalueofnode(node.find('doc:GNSSPosition/ns2:Longitude', ns)),
                 'Altitude': getvalueofnode(node.find('doc:GNSSPosition/ns2:Altitude', ns))} \
                      for node in root]

   df_xml = pd.DataFrame(data_list)

Output

print(df_xml)
#   Altitude      CreatedDateTime   Latitude  Longitude     ReceivedDateTime                VIN
# 0      655  2019-07-31T16:50:28  62.098339  10.542222  2019-07-31T16:50:29  YS2R8X40005440923
# 1     None  2019-07-31T18:13:24  62.127206  10.608676  2019-07-31T18:13:25  YS2R8X40005441367

Thanks for your suggestion @Parfait . Trying this out now and I get the df I wanted. However, was hoping to import the script to Power Bi, but it does not seem to find the df, and I can not refer to df_xml further down in the script either. Any suggestion how to make the df_xml variable count for the entire script? — JonasH, Aug 02 '19 at 08:47
Problem solved! Just had to put in the following line above the line defining df_xml: global df_xml . Thanks again @Parfait, you saved my day :) — JonasH, Aug 02 '19 at 10:06
Great to hear! Alternatively you can have the function `return df_xml` and at end assign to variable: `final_df = main2()`. — Parfait, Aug 02 '19 at 12:21

Transforming XML response into pandas dataframe

1 Answers1