Parsing specific field in XML file in Python

Question

I have an xml file that looks like this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="http://data.treasury.gov:8001/Feed.svc/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">DailyTreasuryYieldCurveRateData</title>
  <id>http://data.treasury.gov:8001/feed.svc/DailyTreasuryYieldCurveRateData</id>
  <updated>2015-08-30T15:17:09Z</updated>
  <link rel="self" title="DailyTreasuryYieldCurveRateData" href="DailyTreasuryYieldCurveRateData" />
  <entry>
    <id>http://data.treasury.gov:8001/Feed.svc/DailyTreasuryYieldCurveRateData(6404)</id>
    <title type="text"></title>
    <updated>2015-08-30T15:17:09Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6404)" />
    <category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:Id m:type="Edm.Int32">6404</d:Id>
        <d:NEW_DATE m:type="Edm.DateTime">2015-08-03T00:00:00</d:NEW_DATE>
        <d:BC_1MONTH m:type="Edm.Double">0.03</d:BC_1MONTH>
        <d:BC_3MONTH m:type="Edm.Double">0.08</d:BC_3MONTH>
        <d:BC_6MONTH m:type="Edm.Double">0.17</d:BC_6MONTH>
        <d:BC_1YEAR m:type="Edm.Double">0.33</d:BC_1YEAR>
        <d:BC_2YEAR m:type="Edm.Double">0.68</d:BC_2YEAR>
        <d:BC_3YEAR m:type="Edm.Double">0.99</d:BC_3YEAR>
        <d:BC_5YEAR m:type="Edm.Double">1.52</d:BC_5YEAR>
        <d:BC_7YEAR m:type="Edm.Double">1.89</d:BC_7YEAR>
        <d:BC_10YEAR m:type="Edm.Double">2.16</d:BC_10YEAR>
        <d:BC_20YEAR m:type="Edm.Double">2.55</d:BC_20YEAR>
        <d:BC_30YEAR m:type="Edm.Double">2.86</d:BC_30YEAR>
        <d:BC_30YEARDISPLAY m:type="Edm.Double">2.86</d:BC_30YEARDISPLAY>
      </m:properties>
    </content>
  </entry>
  <entry>
    <id>http://data.treasury.gov:8001/Feed.svc/DailyTreasuryYieldCurveRateData(6405)</id>
    <title type="text"></title>
    <updated>2015-08-30T15:17:09Z</updated>
    <author>
      <name />
    </author>
    <link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6405)" />
    <category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
    <content type="application/xml">
      <m:properties>
        <d:Id m:type="Edm.Int32">6405</d:Id>
        <d:NEW_DATE m:type="Edm.DateTime">2015-08-04T00:00:00</d:NEW_DATE>
        <d:BC_1MONTH m:type="Edm.Double">0.05</d:BC_1MONTH>
        <d:BC_3MONTH m:type="Edm.Double">0.08</d:BC_3MONTH>
        <d:BC_6MONTH m:type="Edm.Double">0.18</d:BC_6MONTH>
        <d:BC_1YEAR m:type="Edm.Double">0.37</d:BC_1YEAR>
        <d:BC_2YEAR m:type="Edm.Double">0.74</d:BC_2YEAR>
        <d:BC_3YEAR m:type="Edm.Double">1.08</d:BC_3YEAR>
        <d:BC_5YEAR m:type="Edm.Double">1.6</d:BC_5YEAR>
        <d:BC_7YEAR m:type="Edm.Double">1.97</d:BC_7YEAR>
        <d:BC_10YEAR m:type="Edm.Double">2.23</d:BC_10YEAR>
        <d:BC_20YEAR m:type="Edm.Double">2.59</d:BC_20YEAR>
        <d:BC_30YEAR m:type="Edm.Double">2.9</d:BC_30YEAR>
        <d:BC_30YEARDISPLAY m:type="Edm.Double">2.9</d:BC_30YEARDISPLAY>
      </m:properties>
    </content>
  </entry>
</feed>

How can I parse out the '2.16' for 'BC_10YEAR'? I've been looking at other examples with ElementTree and lxml and I just can't seem to match up the xml format in those examples with that of my file.

The last thing I've tried was:

from lxml import etree
doc = etree.parse(yield_xml)
memoryElem = doc.find('content')
print memoryElem.text        # element text
print memoryElem.get('type') # attribute

I get an error: AttributeError: 'NoneType' object has no attribute 'text'

Is there a simple way to do this?

score 1 · Accepted Answer · answered Aug 31 '15 at 03:03

1

You may try built-in split method:

>>>[data.split('>')[1].split('<')[0] for data in str(xml_file).split('<d:') if 'BC_10YEAR' in data][0]
'2.16'

answered Aug 31 '15 at 03:03

mmachine

896
6
10

I tried 'with open('test.xml', 'rb') as xml_file: [data.split('>')[1].split('<')[0] for data in str(xml_file).split(' – Colin Aug 31 '15 at 04:50
It means you test.xml file object differs from example above. – mmachine Aug 31 '15 at 05:50
That's strange, I'm pretty sure my file has exactly what I pasted above. Anyway, I modified to this to get it to work:`with open(yield_xml, 'rb') as yield_file: for line in yield_file: if 'BC_10YEAR' in line: cur_yield = float(line.split('>')[1].split('<')[0]) break` – Colin Sep 01 '15 at 05:13

score 0 · Answer 2 · edited May 23 '17 at 10:26

I'd suggest to use lxml's xpath() method which provide better XPath expression support :

from lxml import etree

doc = etree.parse(yield_xml)

#register prefixes to be used in xpath
ns = {"foo": "http://www.w3.org/2005/Atom",
      "d": "http://schemas.microsoft.com/ado/2007/08/dataservices",
      "m": "http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"}

#select element <d:BC_10YEAR>, and convert the value to number
result = doc.xpath("number(//foo:content/m:properties/d:BC_10YEAR)", namespaces=ns)

#print the result
print(result)
print(type(result))

output :

2.16
<type 'float'>

In case you wonder why foo:content instead of just foo in the xpath expression above, that's because content inherits default namespace from the root element, implicitly. And the default namespace uri is mapped to prefix foo in the above code; related question : parsing xml containing default namespace to get an element value using lxml

Thanks the code works. Unfortunately my knowledge of xml is very limited so I could not understand much of what you said. I do have a question though: how does the code differentiate between the two 'BC_10YEAR' values in the xml file? The first one is 2.16 but there is another one that's 2.23. — Colin, Sep 01 '15 at 05:21
The code will return the first only. Getting the the other one, or all `BC_10YEAR` is perfectly possible with a little change in the xpath argument — har07, Sep 01 '15 at 05:28

Parsing specific field in XML file in Python

2 Answers2