0

I just started to use (=learn) Python 2.7. My current concern is focused on extracting information from XML files. So far xml.etree.ElementTree got me quite far. I am now stuck with an "KeyError". The reason - as far as I understand it - are elements with differing attributes.

Crucial part of the (much bigger) XML file:

<?xml version='1.0' encoding='utf-8' ?>

<XMLFILE>
  <datasources>
    <datasource caption='Sheet1 (ExcelSample)'>
      <connection class='excel-direct' filename='~\SomeExcel.xlsx' .....>
        ......
      </connection>
      <column header='Unit Price' datatype='real' name='[Calculation_1]'     role='measure' type='quantitative'>
        <calculation class='calculation' formula='Sum(Profit)/Sum(Sales)' />
      </column>
      <column datatype='integer' name='[Sales]' role='measure' type='quantitative' user:auto-column='numrec'>
        <calculation class='trial' formula='1' />
      </column>
    </datasource>
  </datasources>
  ........
</XMLFILE>

My Python code works fine for extracting datatype and name, i.e. attributes that exist in both columns:

for cal in xmlfile.findall('datasources/datasource/column'):
    dt= cal.attrib[ 'datatype' ]
    nm= cal.attrib[ 'name' ]
    print 'Column name:', dt, '    ', 'datatype:', nm

result:

Column name: Calculation_1,    datatype:real
Column name: Sales,    datatype:integer

However if I use cal.attrib['header'] Python 2.7. prints

"KeyError: 'header'

Question: How to tell Python 2.7. to produce the desired output:

Calculation "Unit Price": Sum(Profit)/Sum(Sales)

More precisely what Python should do: " for all (= if there are more than only one like in the above example) columns that contain the attribute 'header' print the output

header: Unit Price
    formula: Sum(Profit)
header: Sales per day in month
    formula: Sales / count(days(month))

(Note: to show a more complete desired output I added another column that's not in my example yet)

Thanks a lot for any help!

user2006697
  • 1,107
  • 2
  • 11
  • 25

3 Answers3

2

You can use XPath predicate expression to filter element by certain criteria i.e filter column element that has header attribute : column[@header] *. So your for loop will look about like this :

for cal in xmlfile.findall('datasources/datasource/column[@header]'):
    print "header: " + cal.attrib["header"]
    print "    formula: " + cal.find('calculation').attrib["formula"]

*) Notice that @attribute_name syntax is used to reference XML attribute in XPath.


Instead, if you mean to iterate through all column regardless it has header attribute or not, but only print header attribute value when the column has the attribute, then you can achieve that using simple if block, something like this :

if "header" in cal.attrib:
    print "header: " + cal.attrib["header"]
har07
  • 88,338
  • 12
  • 84
  • 137
  • Hi thanks a lot. Although python now does not give an Error anymore, it unfortunately prints nothing at all anymore .... – user2006697 Mar 10 '16 at 12:45
0

Maybe you can use the "BeautifullSoup" (bs4) module instead of "xml.etree"

Take a look at Python BeautifulSoup XML Parsing and Extracting properly data with bs4?

Community
  • 1
  • 1
A. STEFANI
  • 6,707
  • 1
  • 23
  • 48
0

KeyError is Python's way of telling you that the attribute you requested was not found in that element. This is ok, chances are your findall xpath is pulling in a few elements that don't have "header" attributes. Since you are only interested in those that fall under your search AND happen to have a "header" attribute attached to them, you can do the following:

for cal in xmlfile.findall('datasources/datasource/column'):
    try:
        header = cal.attrib["header"]
        #Do something with the header
        print header
    except KeyError:
        #This is where you end up if the element doesn't have a 'header' attribute
        #You shouldn't have to do anything with this 'cal' element

Sure, you could check if the header exists first, but after using python for some time I think this method is simpler.

pholtz
  • 464
  • 3
  • 10