Aggregating big xml file to dictionary takes too long in Python using lxml etree

Question

I'm having problems iterating and summing up values of a big xml file (~300MB) into a python dictionary. I quickly realized that it's not lxml etrees iterparse which is slowing things down but the access to a dictionary on each iteration.

The following is a code snippet from my XML file:

    <timestep time="7.00">
        <vehicle id="1" eclass="HBEFA3/PC_G_EU4" CO2="0.00" CO="0.00" HC="0.00" NOx="0.00" PMx="0.00" fuel="0.00" electricity="0.00" noise="54.33" route="!1" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-27444291_0" pos="26.79" speed="4.71" angle="54.94" x="3613.28" y="1567.25"/>
        <vehicle id="2" eclass="HBEFA3/PC_G_EU4" CO2="3860.00" CO="133.73" HC="0.70" NOx="1.69" PMx="0.08" fuel="1.66" electricity="0.00" noise="65.04" route="!2" type="DEFAULT_VEHTYPE" waiting="0.00" lane=":1785290_3_0" pos="5.21" speed="3.48" angle="28.12" x="789.78" y="2467.09"/>
    </timestep>
    <timestep time="8.00">
        <vehicle id="1" eclass="HBEFA3/PC_G_EU4" CO2="0.00" CO="0.00" HC="0.00" NOx="0.00" PMx="0.00" fuel="0.00" electricity="0.00" noise="58.15" route="!1" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-27444291_0" pos="31.50" speed="4.71" angle="54.94" x="3617.14" y="1569.96"/>
        <vehicle id="2" eclass="HBEFA3/PC_G_EU4" CO2="5431.06" CO="135.41" HC="0.75" NOx="2.37" PMx="0.11" fuel="2.33" electricity="0.00" noise="68.01" route="!2" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-412954611_0" pos="1.38" speed="5.70" angle="83.24" x="795.26" y="2467.99"/>
        <vehicle id="3" eclass="HBEFA3/PC_G_EU4" CO2="2624.72" CO="164.78" HC="0.81" NOx="1.20" PMx="0.07" fuel="1.13" electricity="0.00" noise="55.94" route="!3" type="DEFAULT_VEHTYPE" waiting="0.00" lane="22338220_0" pos="5.10" speed="0.00" angle="191.85" x="2315.21" y="2613.18"/>
    </timestep>

Each timestep has a growing number of vehicles in it. There are around 11800 timesteps in this file.

Now I want to sum up the values for all vehicles based on their location. There are x, y values provided which I can convert to lat, long.

My current approach is to iterate over the file with lxml etree iterparse and sum up the values using lat,long as dict key.

I'm using fast_iter from this article https://www.ibm.com/developerworks/xml/library/x-hiperfparse/

from lxml import etree

raw_pollution_data = {}

def fast_iter(context, func):
    for _, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def aggregate(vehicle):
    veh_id = int(vehicle.attrib["id"])
    veh_co2 = float(vehicle.attrib["CO2"])
    veh_co = float(vehicle.attrib["CO"])
    veh_nox = float(vehicle.attrib["NOx"]) 
    veh_pmx = float(vehicle.attrib["PMx"]) # mg/s
    lng, lat = net.convertXY2LonLat(float(vehicle.attrib["x"]), float(vehicle.attrib["y"]))

    coordinate = str(round(lat, 4)) + "," + str(round(lng, 4))

    if coordinate in raw_pollution_data:
        raw_pollution_data[coordinate]["CO2"] += veh_co2
        raw_pollution_data[coordinate]["NOX"] += veh_nox
        raw_pollution_data[coordinate]["PMX"] += veh_pmx
        raw_pollution_data[coordinate]["CO"] += veh_co
    else:
        raw_pollution_data[coordinate] = {}
        raw_pollution_data[coordinate]["CO2"] = veh_co2
        raw_pollution_data[coordinate]["NOX"] = veh_nox
        raw_pollution_data[coordinate]["PMX"] = veh_pmx
        raw_pollution_data[coordinate]["CO"] = veh_co

def parse_emissions():
    xml_file = "/path/to/emission_output.xml"
    context = etree.iterparse(xml_file, tag="vehicle")
    fast_iter(context, aggregate)
    print(raw_pollution_data)

However, this approach takes around 25 minutes to parse the whole file. I'm not sure how to do it differently. I know the global variable is awful but I thought that would make it cleaner?

Can you think of something else? I know it's because of the dictionary. Without the aggregate function, fast_iter takes around 25 seconds.

I can imagin, with [use-multiprocessing-queue-in-python](https://stackoverflow.com/questions/11515944/how-to-use-multiprocessing-queue-in-python), you can reduce the overall time. — stovfl, Oct 12 '19 at 12:10
You are doing a lot of needlessly verbose work. How fast can you process the file when you use `fast_iter(context, lambda node: None)`? — Martijn Pieters, Oct 12 '19 at 12:56
When you convert the Sumo `(x, y)` coordinate to a latitude and longitude and round the values, is that aimed at aggregating vehicles in the same geographic area? Can this same aggregation be done with the `x` and `y` values themselves? — Martijn Pieters, Oct 12 '19 at 13:06
@MartijnPieters What do you mean by needlessly verbose work? Just parsing the file without doing anything takes ~25 seconds. — scheda74, Oct 12 '19 at 13:39
@MartijnPieters I tested it without converting x, y values to lat, long and it's way faster but that's not really the point. I want to aggregate some of those locations. So converting the x, y values gives me a lat with around 10 decimal places. That's why I'm rounding them to summarize some of those points to one. — scheda74, Oct 12 '19 at 13:41
@scheda74: exactly, but so would using `x = int(float(vehicle.attrib["x"]))` (and using the same for `y`). You don’t have to use long/lat coordinates just to aggregate by area. — Martijn Pieters, Oct 12 '19 at 13:43
Right, took a sec as I didn't have sumo installed and creating a minimal example took a little doing. I've updated my answer to use pandas, this should be *way faster* as we can then do all the work in aggregate. — Martijn Pieters, Oct 12 '19 at 16:05
11800 timesteps is not that many, that's less than 10MB of XML data. Is there a lot of other data in that XML document that's simply not timestep data? — Martijn Pieters, Oct 12 '19 at 16:16

Martijn Pieters · Accepted Answer · 2019-10-13T11:53:21.247

Your code is slow for 2 reasons:

You do unnecessary work, and use inefficient Python statements. You don't use veh_id but still use int() to convert it. You create an empty dictionary only to set 4 keys in it in separate statements, you use separate str() and round() calls together with string concatenation where string formatting can do all that work in one step, you repeatedly reference .attrib, so Python has to repeatedly find that dictionary attribute for you.
The sumolib.net.convertXY2LonLat() implementation is very inefficient when used for each and every individual (x, y) coordinate; it loads the offset and pyproj.Proj() object from scratch each time. We can cut out repeated operations here by caching the pyproj.Proj() instance, for example. Or we could avoid using it, or use it just once by processing all coordinates in a single step.

The first issue can mostly be avoided by removing unnecessary work and by caching things like the attributes dictionary, using it just once, and by caching repeated global name lookups in function arguments (local names are faster to use); the _... keywords are there purely to avoid looking up globals:

from operator import itemgetter

_fields = ('CO2', 'CO', 'NOx', 'PMx')

def aggregate(
    vehicle,
    _fields=_fields,
    _get=itemgetter(*_fields, 'x', 'y'),
    _conv=net.convertXY2LonLat,
):
    # convert all the fields we need to floats in one step
    *values, x, y = map(float, _get(vehicle.attrib))
    # convert the coordinates to latitude and longitude
    lng, lat = _conv(x, y)
    # get the aggregation dictionary (start with an empty one if missing)
    data = raw_pollution_data.setdefault(
        f"{lng:.4f},{lat:.4f}",
        dict.fromkeys(_fields, 0.0)
    )
    # and sum the numbers
    for f, v in zip(_fields, values):
        data[f] += v

To address the second issue we can replace the location lookup with something that at least re-uses the Proj() instance; we need to apply the location offset manually in that case:

proj = net.getGeoProj()
offset = net.getLocationOffset()
adjust = lambda x, y, _dx=offset[0], _dy=offset[1]: (x - _dx, y - _dy)

def longlat(x, y, _proj=proj, _adjust=adjust):
    return _proj(*_adjust(x, y), inverse=True)

then use this in the aggregation function by replacing the _conv local name:

def aggregate(
    vehicle,
    _fields=_fields,
    _get=itemgetter(*_fields, 'x', 'y'),
    _conv=longlat,
):
    # function body stays the same

This is still going to be slow, because it requires that we convert each (x, y) pair separately.

It depends on the exact projection used, but you could simply quantize the x and y coordinates themselves to do the grouping. You’d apply the offset first, then “round” the coordinates by the same amount conversion and rounding would achieve. When projecting (1, 0) and (0, 0) and taking the difference in longitude we know the rough conversion rate used by the projection, and dividing that by 10.000 gives us the size of you aggregation area in terms of x and y values:

 (proj(1, 0)[0] - proj(0, 0)[0]) / 10000

For a standard UTM projection the gives me about 11.5, so multiplying then floor dividing the x and y coordinates by that factor should give you roughly the same amount of grouping without having to do a full coordinate conversion for every timestep data point:

proj = net.getGeoProj()
factor = abs(proj(1, 0)[0] - proj(0, 0)[0]) / 10000
dx, dy = net.getLocationOffset()

def quantise(v, _f=factor):
    return v * _f // _f

def aggregate(
    vehicle,
    _fields=_fields,
    _get=itemgetter(*_fields, 'x', 'y'),
    _dx=dx, _dy=dy,
    _quant=quantise,
):
    *values, x, y = map(float, _get(vehicle.attrib))
    key = _quant(x - _dx), _quant(y - _dy)
    data = raw_pollution_data.setdefault(key, dict.fromkeys(_fields, 0.0))
    for f, v in zip(_fields, values):
        data[f] += v

For the very limited dataset shared in the question this gives me the same results.

However, it could be that this leads to distorted results at different points on the map, if the projection varies across longitudes. I also don’t know how exactly you needed to aggregate vehicle coordinates across an area.

If you really can only aggregate by areas of 1/10000th degrees longitude and latitude, then you can make the conversion from (x, y) pairs to long/lat pairs a lot faster if you instead feed whole numpy arrays to net.convertXY2LonLat(). That's because pyproj.Proj() accepts arrays to convert coordinates in bulk, saving a considerable amount of time avoiding making hundreds of thousands of separate conversion calls, we’d make just a single call instead.

Rather than handle this with Python dictionary and float objects, you should really use a Pandas DataFrame here. It can trivially take in strings taken from each element attribute dictionary (using an operator.itemgetter() object with all the desired keys gives you these values very fast) and turn all those string values into floating point numbers as it ingests the data. These values are stored in compact binary form in contiguous memory, 11800 rows of coordinates and data entries won't take much memory here.

So, load your data into a DataFrame first, then from that object convert your (x, y) coordinates in a single step, and only then aggregate values by area by using the Pandas grouping functionality:

from lxml import etree
import pandas as pd
import numpy as np

from operator import itemgetter

def extract_attributes(context, fields):
    values = itemgetter(*fields)
    for _, elem in context:
        yield values(elem.attrib)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def parse_emissions(filename):
    context = etree.iterparse(filename, tag="vehicle")

    # create a dataframe from XML data a single call
    coords = ['x', 'y']
    entries = ['CO2', 'CO', 'NOx', 'PMx']
    df = pd.DataFrame(
        extract_attributes(context, coords + entries),
        columns=coords + entries, dtype=np.float)

    # convert *all coordinates together*, remove the x, y columns
    # note that the net.convertXY2LonLat() call *alters the 
    # numpy arrays in-place* so we don’t want to keep them anyway. 
    df['lng'], df['lat'] = net.convertXY2LonLat(df.x.to_numpy(), df.y.to_numpy())
    df.drop(coords, axis=1, inplace=True)

    # 'group' data by rounding the latitude and longitude
    # effectively creating areas of 1/10000th degrees per side
    lnglat = ['lng', 'lat']
    df[lnglat] = df[lnglat].round(4)

    # aggregate the results and return summed dataframe
    return df.groupby(lnglat)[entries].sum()

emissions = parse_emissions("/path/to/emission_output.xml")
print(emissions)

Using Pandas, a sample sumo net definition file, and a reconstructed XML file by repeating your 2 sample timestep entries 5900 times, I can parse the whole data set in about 1 second, total time. However, I suspect that your 11800 timesets number is way too low (as that’s less than 10MB XML data), so I wrote out 11800 * 20 == 236000 times your sample to a file, and that took 22 seconds to process with Pandas.

You could also look at GeoPandas, which would let you aggregate by geographical areas.

You Sir, deserve a medal! Works perfectly and the whole thing takes around 20 seconds. The file contains 11800 timestep but each timestep has a growing number of vehicles in it.. so it's roughly 11800 * 100 entries to parse. In the end I aggregated 18530 rows. Thanks!!! — scheda74, Oct 13 '19 at 10:49

Aggregating big xml file to dictionary takes too long in Python using lxml etree

1 Answers1