2

I've tried searching various questions and answers here on StackOverflow and cannot find a solution that works for my situation, so here is my issue.

I have 3 xml files that I am attempting to compare. The issue I am having is grabbing sections of the "Main" XML file at a time and keeping the information together. For example, I want to keep the information associated with 1 and be able to use each piece within the script.

This XML file can have any number of fields between the tags but I am only needing 5 specific fields. I am fairly new to Python and extremely new to using Python to read more than a text file, any help would be appreciated.

A sample of the xml is below.

Main XML:
    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <resultset table="foo_bar">
    <row>
        <field name="id">1</field>
        <field name="name">foo 1</field>
        <field name="item 1">bar 1</field>
        <field name="item 2">Accepted</field>
        <field name="item 3">Accepted</field>
    </row>
    <row>
        <field name="id">2</field>
        <field name="name">foo 2</field>
        <field name="item 1">bar 2</field>
        <field name="item 2">Declined</field>
        <field name="item 3">Accepted</field>
    </row>
    <row>
        <field name="id">3</field>
        <field name="name">foo 3</field>
        <field name="item 1">bar 3</field>
        <field name="item 2">Accepted</field>
        <field name="item 3">Declined</field>
    </row>
    .....Continues
    </resultset>

I have tried following the various answers for similar questions, but have had no success thus far.

EDIT I've tried multiple things, I'll have to dig through the various .py scripts to find all of them. Here is the most recent based on the Question posted here

from lxml import etree as ET

def filter_by_itemid(doc, idlist):
    rowset = doc.xpath("//row")
    for elem in rowset.getchildren():
        if elem.get("*") not in idlist:
            rowset.remove(elem)
    return doc

doc = ET.parse("my.xml")
filter_by_itemid(doc, ['id', 'name', 'item 1', 'item 2', 'item 3'])

print(ET.tostring(doc))

I know I am doing something wrong somewhere, and the formatting of the xml (which I am unable to change at the source) isn't helping...

The error I receive is "AttributeError: 'list' object has no attribute 'getchildren' "

Community
  • 1
  • 1
Mike S.
  • 39
  • 1
  • 9

1 Answers1

2

How about something like this:

from lxml import etree

root = etree.parse('xml.xml')
rows = root.findall('row')

all_data = []

for row in rows:
    field_dict = {}
    fields = row.findall('field')

    for field in fields:
        field_dict[field.get('name')] = field.text

    print(field_dict)

    all_data.append(field_dict)

print(all_data)


--output:--
{'item 3': 'Accepted', 'item 2': 'Accepted', 'item 1': 'bar 1', 'id': '1', 'name': 'foo 1'}
{'item 3': 'Accepted', 'item 2': 'Declined', 'item 1': 'bar 2', 'id': '2', 'name': 'foo 2'}
{'item 3': 'Declined', 'item 2': 'Accepted', 'item 1': 'bar 3', 'id': '3', 'name': 'foo 3'}


[{'item 3': 'Accepted', 'item 2': 'Accepted', 'item 1': 'bar 1', 'id': '1', 'name': 'foo 1'}, {'item 3': 'Accepted', 'item 2': 'Declined', 'item 1': 'bar 2', 'id': '2', 'name': 'foo 2'}, {'item 3': 'Declined', 'item 2': 'Accepted', 'item 1': 'bar 3', 'id': '3', 'name': 'foo 3'}]

The extra fields that may be in a row will be in the field_dict, but you can just ignore them. Or, if that doesn't work for you, you can filter out the garbage:

from lxml import etree

root = etree.parse('xml.xml')
rows = root.findall('row')

#Create a set:
allowed_names = {
    'id',
    'name',
    'item 1',
    'item 2',
    'item 3'
}

all_data = []


for row in rows:
    field_dict = {}
    fields = row.findall('field')

    for field in fields:
        name_val = field.get('name')

        if name_val in allowed_names:
            field_dict[name_val] = field.text

    print(field_dict)

    all_data.append(field_dict)

print(all_data)

And if it's more convenient, you can define all_data to be a dictionary, and use the id's for the keys, and the value for each key can be a dictionary with the rest of the data.

7stud
  • 46,922
  • 14
  • 101
  • 127
  • Thanks for the prompt reply, I thought it was a "storage" issue based on the error. Wasn't sure how to correct the problem. I still need to filter out all the extra fields, the sample I provided only included an example of what I needed. The xml itself can include upwards of 15 extra fields between each set of tags. – Mike S. Jun 07 '13 at 18:14
  • @MikeS., I added an example with a filter. – 7stud Jun 07 '13 at 18:17
  • Thank you again, I can't believe I overlooked the use of sets to filter the information. I am voting for this answer as soon as the page reloads from posting this comment. – Mike S. Jun 07 '13 at 18:30
  • 1) You could've used a list for the allowed names--'in' works on a list too--it's just not as fast as set lookups. 2) You could've used a dictionary with the keys being the allowed names and each of their values being True then you could've written...if allowed_names.get(name_val, False): field_dict[name_val] = field.text. – 7stud Jun 07 '13 at 18:39
  • I will definitely look into the alternatives, my overall goal is to eventually compare this xml with 2 other files and create an output that lists the values from this one that are not in the other 2. This has been huge help for me at the current part of the python program. I work on programs in stages/parts to make things a little easier. – Mike S. Jun 07 '13 at 19:07