I'm having problems iterating and summing up values of a big xml file (~300MB) into a python dictionary. I quickly realized that it's not lxml etrees iterparse which is slowing things down but the access to a dictionary on each iteration.
The following is a code snippet from my XML file:
<timestep time="7.00">
<vehicle id="1" eclass="HBEFA3/PC_G_EU4" CO2="0.00" CO="0.00" HC="0.00" NOx="0.00" PMx="0.00" fuel="0.00" electricity="0.00" noise="54.33" route="!1" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-27444291_0" pos="26.79" speed="4.71" angle="54.94" x="3613.28" y="1567.25"/>
<vehicle id="2" eclass="HBEFA3/PC_G_EU4" CO2="3860.00" CO="133.73" HC="0.70" NOx="1.69" PMx="0.08" fuel="1.66" electricity="0.00" noise="65.04" route="!2" type="DEFAULT_VEHTYPE" waiting="0.00" lane=":1785290_3_0" pos="5.21" speed="3.48" angle="28.12" x="789.78" y="2467.09"/>
</timestep>
<timestep time="8.00">
<vehicle id="1" eclass="HBEFA3/PC_G_EU4" CO2="0.00" CO="0.00" HC="0.00" NOx="0.00" PMx="0.00" fuel="0.00" electricity="0.00" noise="58.15" route="!1" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-27444291_0" pos="31.50" speed="4.71" angle="54.94" x="3617.14" y="1569.96"/>
<vehicle id="2" eclass="HBEFA3/PC_G_EU4" CO2="5431.06" CO="135.41" HC="0.75" NOx="2.37" PMx="0.11" fuel="2.33" electricity="0.00" noise="68.01" route="!2" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-412954611_0" pos="1.38" speed="5.70" angle="83.24" x="795.26" y="2467.99"/>
<vehicle id="3" eclass="HBEFA3/PC_G_EU4" CO2="2624.72" CO="164.78" HC="0.81" NOx="1.20" PMx="0.07" fuel="1.13" electricity="0.00" noise="55.94" route="!3" type="DEFAULT_VEHTYPE" waiting="0.00" lane="22338220_0" pos="5.10" speed="0.00" angle="191.85" x="2315.21" y="2613.18"/>
</timestep>
Each timestep has a growing number of vehicles in it. There are around 11800 timesteps in this file.
Now I want to sum up the values for all vehicles based on their location. There are x, y values provided which I can convert to lat, long.
My current approach is to iterate over the file with lxml etree iterparse and sum up the values using lat,long as dict key.
I'm using fast_iter from this article https://www.ibm.com/developerworks/xml/library/x-hiperfparse/
from lxml import etree
raw_pollution_data = {}
def fast_iter(context, func):
for _, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def aggregate(vehicle):
veh_id = int(vehicle.attrib["id"])
veh_co2 = float(vehicle.attrib["CO2"])
veh_co = float(vehicle.attrib["CO"])
veh_nox = float(vehicle.attrib["NOx"])
veh_pmx = float(vehicle.attrib["PMx"]) # mg/s
lng, lat = net.convertXY2LonLat(float(vehicle.attrib["x"]), float(vehicle.attrib["y"]))
coordinate = str(round(lat, 4)) + "," + str(round(lng, 4))
if coordinate in raw_pollution_data:
raw_pollution_data[coordinate]["CO2"] += veh_co2
raw_pollution_data[coordinate]["NOX"] += veh_nox
raw_pollution_data[coordinate]["PMX"] += veh_pmx
raw_pollution_data[coordinate]["CO"] += veh_co
else:
raw_pollution_data[coordinate] = {}
raw_pollution_data[coordinate]["CO2"] = veh_co2
raw_pollution_data[coordinate]["NOX"] = veh_nox
raw_pollution_data[coordinate]["PMX"] = veh_pmx
raw_pollution_data[coordinate]["CO"] = veh_co
def parse_emissions():
xml_file = "/path/to/emission_output.xml"
context = etree.iterparse(xml_file, tag="vehicle")
fast_iter(context, aggregate)
print(raw_pollution_data)
However, this approach takes around 25 minutes to parse the whole file. I'm not sure how to do it differently. I know the global variable is awful but I thought that would make it cleaner?
Can you think of something else? I know it's because of the dictionary. Without the aggregate function, fast_iter takes around 25 seconds.