2

I'm extracting instances of three elements from an XML file: ComponentStr, keyID, and valueStr. Whenever I find a ComponentStr, I want to add/associate the keyID:valueStr to it. ComponentStr values are not unique. As multiple occurrences of a ComponentStr is read, I want to accumulate the keyID:valueStr for that ComponentStr group. The resulting accumulated data structure after reading the XML file might look like this:

ComponentA: key1:value1, key2:value2, key3:value3

ComponentB: key4:value4

ComponentC: key5:value5, key6:value6

After I generate the final data structure, I want to sort the keyID:valueStr entries within each ComponentStr and also sort all the ComponentStrs.

I'm trying to structure this data in Python 2. ComponentStr seem to work well as a set. The keyID:valueStr is clearly a dict. But how do I associate a ComponentStr entry in a set with its dict entries?

Alternatively, is there a better way to organize this data besides a set and associated dict entries? Each keyID is unique. Perhaps I could have one dict of keyID:some combo of ComponentStr and valueStr? After the data structure was built, I could sort it based on ComponentStr first, then perform some type of slice to group the keyID:valueStr and then sort again on the keyID? Seems complicated.

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
  • Steven, welcome to Stackoverflow, nice question by the way, remember to accept the answer that works the best for you by clicking the checkmark next to it, it will give you +2 to your rep. – Russia Must Remove Putin Jun 16 '14 at 04:47

2 Answers2

2

How about a dict of dicts?

data = {
'ComponentA': {'key1':'value1', 'key2':'value2', 'key3':'value3'},
'ComponentB': {'key4':'value4'},
'ComponentC': {'key5':'value5', 'key6':'value6'},
}

It maintains your data structure and mapping. Interestingly enough, the underlying implementation of dicts is similar to the implementation of sets.

This would be easily constructed a'la this pseudo-code:

data = {}
for file in files:
    data[get_component(file)] = {}
    for key, value in get_data(file):
        data[get_component(file)][key] = value

in the case where you have repeated components, you need to have the sub-dict as the default, but add to the previous one if it's there. I prefer setdefault to other solutions like a defaultdict or subclassing dict with a __missing__ as long as I only have to do it once or twice in my code:

data = {}
for file in files:
    for key, value in get_data(file):
        data.setdefault([get_component(file)], {})[key] = value

It works like this:

>>> d = {}
>>> d.setdefault('foo', {})['bar'] = 'baz'
>>> d
{'foo': {'bar': 'baz'}}
>>> d.setdefault('foo', {})['ni'] = 'ichi'
>>> d
{'foo': {'ni': 'ichi', 'bar': 'baz'}}

alternatively, as I read your comment on the other answer say you need simple code, you can keep it really simple with some more verbose and less optimized code:

data = {}
for file in files:
    for key, value in get_data(file):
        if get_component(file) not in data:
            data[get_component(file)] = {}
        data[get_component(file)][key] = value

You can then sort when you're done collecting the data.

for component in sorted(data):
    print(component)
    print('-----')
    for key in sorted(data[component]):
        print(key, data[component][key])
Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
  • Thanks for the help! After posting the question, I took a walk to clear my head and during that I also came up with the dict of dicts approach.But before adding a ComponentX to the dict, won't I have to check to see if it already exists in the dict? Or can I simply add the {ComponentX:{keyID:valueStr}} entry to the dict and Python will handle it appropriately? ("Appropriately" in this case is: If ComponentX not in dict, add it. Then add {ComponentX:{keyID:valueStr}}.) – Steven Calwas Jun 16 '14 at 05:25
  • No, in the case where you'll have more than one, you'll need to use something like setdefault, a defaultdict, or subclass a dict with `__missing__` (see http://stackoverflow.com/questions/635483/what-is-the-best-way-to-implement-nested-dictionaries-in-python/19829714#19829714) . I'll explain in the answer with setdefault. – Russia Must Remove Putin Jun 16 '14 at 05:28
1

I want to accumulate the keyID:valueStr for that ComponentStr group

In this case you want to have the keys of your dictionary as the ComponentStr, accumulating to me immediately goes to a list, which are easily ordered.

Each keyID is unique. Perhaps I could have one dict of keyID:some combo of ComponentStr and valueStr?

You should store your data in a manner that is the most efficient when you want to retrieve it. Since you will be accessing your data by the component, even though your keys are unique there is no point in having a dictionary that is accessed by your key (since this is not how you are going to "retrieve" the data).

So, with that - how about using a defaultdict with a list, since you really want all items associated with the same component:

from collections import defaultdict

d = defaultdict(list)

with open('somefile.xml', 'r') as f:
   for component, key, value in parse_xml(f):
       d[component].append((key, value))

Now you have for each component, a list of tuples which are the associated key and values.

If you want to keep the components in the order that they are read from the file, you can use a OrderedDict (also from the collections module), but if you want to sort them in any arbitrary order, then stick with a normal dictionary.

To get a list of sorted component names, just sort the keys of the dictionary:

component_sorted = sorted(d.keys())

For a use case of printing the sorted components with their associated key/value pairs, sorted by their keys:

for key in component_sorted:
   values = d[key]
   sorted_values = sorted(values, key=lamdba x: x[0])  # Sort by the keys
   print('Pairs for {}'.format(key))
   for k,v in sorted_values:
       print('{} {}'.format(k,v)) 
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284