parse large xml in python

Question

I have a very large xml file (about 100mb) with multiple elements similar to the one in this example

<adrmsg:hasMember>
    <aixm:DesignatedPoint gml:id="ID_197095_1650420151927_74256">
        <gml:identifier codeSpace="urn:uuid:">084e1bb6-94f7-450f-a88e-44eb465cd5a6</gml:identifier>
        <aixm:timeSlice>
            <aixm:DesignatedPointTimeSlice gml:id="ID_197095_1650420151927_74257">
                <gml:validTime>
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74258">
                        <gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
                        <gml:endPosition indeterminatePosition="unknown"/>
                    </gml:TimePeriod>
                </gml:validTime>
                <aixm:interpretation>BASELINE</aixm:interpretation>
                <aixm:featureLifetime>
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74259">
                        <gml:beginPosition>2020-12-31T00:00:00</gml:beginPosition>
                        <gml:endPosition indeterminatePosition="unknown"/>
                    </gml:TimePeriod>
                </aixm:featureLifetime>
                <aixm:designator>BITLA</aixm:designator>
                <aixm:type>ICAO</aixm:type>
                <aixm:location>
                    <aixm:Point gml:id="ID_197095_1650420151927_74260">
                        <gml:pos srsName="urn:ogc:def:crs:EPSG::4326">40.87555555555556 21.358055555555556</gml:pos>
                    </aixm:Point>
                </aixm:location>
                <aixm:extension>
                    <adrext:DesignatedPointExtension gml:id="ID_197095_1650420151927_74261">
                        <adrext:pointUsage>
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74262">
                                <adrext:role>FRA_ENTRY</adrext:role>
                                <adrext:reference_border>
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74263">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
                                    </adrext:AirspaceBorderCrossingObject>
                                </adrext:reference_border>
                            </adrext:PointUsage>
                        </adrext:pointUsage>
                        <adrext:pointUsage>
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74264">
                                <adrext:role>FRA_EXIT</adrext:role>
                                <adrext:reference_border>
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74265">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
                                    </adrext:AirspaceBorderCrossingObject>
                                </adrext:reference_border>
                            </adrext:PointUsage>
                        </adrext:pointUsage>
                    </adrext:DesignatedPointExtension>
                </aixm:extension>
            </aixm:DesignatedPointTimeSlice>
        </aixm:timeSlice>
    </aixm:DesignatedPoint>
</adrmsg:hasMember>

The ultimate goal is to have in a pandas DataFrame parsed data from this very big xml file.

So far I cannot 'capture' the data that I am looking for. I manage only to 'capture' the last data from the very last element in that large xml file.

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

ab = {'aixm':'http://www.aixm.aero/schema/5.1.1', 'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR', 'gml':'http://www.opengis.net/gml/3.2'}
for point in root.findall('.//aixm:DesignatedPointTimeSlice', ab):
    designator = point.find('.//aixm:designator', ab)
    d = point.find('.//{http://www.aixm.aero/schema/5.1.1}type', ab)
for pos in point.findall('.//gml:pos', ab):
    print(designator.text, pos.text, d.text)

the print statement returns the data that I would like to have but as mentioned, only for the very last element of the file whereas I would like to have the result returned for all of them

ZIFSA 54.02111111111111 27.823888888888888 ICAO

Could I be pls advice on the path I should follow? I need some help pls Thank you very much

I am stuck there. Dont know how to proceed. I've tried and all I get from the initial df is: DesignatedPoint 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN ... ... 135038 NaN 135039 NaN 135040 NaN 135041 NaN 135042 NaN [135043 rows x 1 columns] — setan0, Jun 11 '22 at 09:42
df = pd.read_xml('file.xml', xpath=".//gml:pos", namespaces={"gml":"http://www.opengis.net/gml/3.2"}) It does the trick for one of the 'xpath' that I am looking for, with its correspondent namespace. Now, maybe additional guidance to achieve the ultimate goal. Do you know how to add multiple xpath and namespaces there? — setan0, Jun 11 '22 at 09:56
Please post the root of XML and wherever else namespaces are defined. As of now, your snippet post is not well-formed XML. — Parfait, Jun 12 '22 at 01:06

Parfait · Accepted Answer · 2022-06-12T16:57:56.143

Assuming all three needed nodes (aixm:designator, aixm:type, and gml:pos) are always present, consider parsing the parent nodes, aixm:DesignatedPointTimeSlice and axim:Point and then join them. Finally, select the three final columns needed.

import pandas as pd

ab = {
    'aixm':'http://www.aixm.aero/schema/5.1.1', 
    'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR',
    'gml':'http://www.opengis.net/gml/3.2'
}

time_slice_df = pd.read_xml(
    'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab
).add_prefix("time_slice_")

point_df  = pd.read_xml(
    'file.xml', xpath=".//aixm:Point", namespaces=ab
).add_prefix("point_")

time_slice_df = (
    time_slice_df.join(point_df)
    .reindex(
        ["time_slice_designator", "time_slice_type", "point_pos"], 
        axis="columns"
    )
)

And in forthcoming pandas 1.5, read_xml will support iterparse allowing retrieval of descendant nodes not limited to XPath expressions:

time_slice_df = pd.read_xml(
    'file.xml', 
    namespaces = ab, 
    iterparse = {"aixm:DesignatedPointTimeSlice": 
        ["aixm:designator", "axim:type", "aixm:Point"]
    }
)

parse large xml in python

1 Answers1