0

I have the following data structure (original is 2.5gb, therefore parsing mandatory):

<households xmlns="http://www.matsim.org/files/dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.matsim.org/files/dtd http://www.matsim.org/files/dtd/households_v1.0.xsd">
    <household id="1473">
        <members>
            <personId refId="2714"/>
            <personId refId="2715"/>
            <personId refId="2716"/>
            <personId refId="2717"/>
            <personId refId="2718"/>
            <personId refId="2719"/>
        </members>
        <income currency="CHF" period="month">
                3094.87101
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >some</attribute>
            <attribute name="carAvailability" class="java.lang.String" >some</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >3.3</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >3094.8710104279835</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >1</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >10213.074334412346</attribute>
        </attributes>

    </household>
    <household id="2474">
        <members>
            <personId refId="4647"/>
            <personId refId="4648"/>
            <personId refId="4649"/>
            <personId refId="4650"/>
            <personId refId="4651"/>
            <personId refId="4652"/>
            <personId refId="4653"/>
            <personId refId="4654"/>
            <personId refId="4655"/>
        </members>
        <income currency="CHF" period="month">
                1602.562822
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >none</attribute>
            <attribute name="carAvailability" class="java.lang.String" >all</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >3.6999999999999997</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >1602.5628215679633</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >1</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >5929.482439801463</attribute>
        </attributes>

    </household>
    <household id="4024">
        <members>
            <personId refId="7685"/>
        </members>
        <income currency="CHF" period="month">
                61610.096619
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >none</attribute>
            <attribute name="carAvailability" class="java.lang.String" >none</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >1.0</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >61610.096618936936</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >0</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >61610.096618936936</attribute>
        </attributes>

    </household>
</households>

I want to extract all the person ID refId Values and their corresponding income Value. Eventually, I plan to have a df with one column of the personId and one column with the income (income will be repetitive). So the tricky part is not only the namespace, but also how to access XML at different node levels.

My approach so far fails to do that.

import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd
import numpy as np

tree = ET.parse(gzip.open('V0_1pm/output_households.xml.gz', 'r'))
root = tree.getroot()
rows = []
for it in root.iter('household'):
    hh = it.attrib['id']
    inc = it.find('income').text
    rows.append([hh,inc])

hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
hh_inc

Any help is highly appreciated.

dspencer
  • 4,297
  • 4
  • 22
  • 43
Yves
  • 536
  • 4
  • 15
  • "therefore parsing mandatory" What do you mean by that? – mzjn Jan 18 '20 at 15:10
  • 1
    How does your approach fail? – larsks Jan 18 '20 at 15:14
  • I'll second what the other comments asked. There really isn't much to work with here. – AMC Jan 18 '20 at 15:35
  • 1
    Your xml has a namespace - [Parsing XML with Namespaces](https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces). `root.iter('{http://www.matsim.org/files/dtd}household')` – wwii Jan 18 '20 at 15:37
  • @mzjn i meant with it that my computer does not have the computational power to extract the information without this method. – Yves Jan 18 '20 at 15:50
  • @larsks I'm not able to get any output, except for an empty df – Yves Jan 18 '20 at 15:51
  • What is "this method"? In the question, you use `ET.parse()` and after that `root.iter()`. – mzjn Jan 18 '20 at 15:57

1 Answers1

1

The reason that your code failed is that your input elements have non-empty namespace.

One of methods to process namespaced XML is to:

  • Define a dictionary "shortcut: namespace", containing all namespaces used in your XPath expressions.
  • Call findall or find, passing this dictionary as the second parameter and prepend relevant namespace shortcuts (and a colon as a separator) in XPath expressions.

Note also that find(...).text returns full text, with newline chars and spaces. To cope with this, you probably should:

  • Strip the content read from "surrounding" white chars.
  • Convert it to float.

So change your code to:

# Namespace dictionary
ns = {'dtd': 'http://www.matsim.org/files/dtd'}
rows = []
for it in root.findall('dtd:household', ns):
    hh = it.attrib['id']
    inc = it.find('dtd:income', ns).text
    inc = float(inc.strip())
    rows.append([hh, inc])
hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
hh_inc

For your sample input I got:

     id  PTSubscription
0  1473     3094.871010
1  2474     1602.562822
2  4024    61610.096619

Edit following the question about refId

I assume that the DataFrame should contain separate row for each refId, with relevant id, and PTSubscription.

To include refId, change the loop to:

for it in root.findall('dtd:household', ns):
    hh = it.attrib['id']
    inc = it.find('dtd:income', ns).text
    inc = float(inc.strip())
    pids = it.findall('.//dtd:personId', ns)
    for pId in pids:
        refId = pId.attrib['refId']
        rows.append([hh, inc, int(refId)])
    if not pids:
        rows.append([hh, inc, -1])

I added last 2 instructions in order not to "loose" any household containing no refId.

When creating the DataFrame, pass the additional column name:

hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription', 'refId'])
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Thank you very much for your work! it does work. Unfortunately, in the code I've started I used the household ID only. What I actually wanted is the ```refId``` of the persons living in the household. I am, however, not sure how to adjust this code, as it is inside another node than the income. – Yves Jan 18 '20 at 16:10