0

I'm writing a python script which parses "n" number of xml's first and creates a dict of dict's with key-value being xml attributes (inside nested dicts). Now, I want to group these nested dicts stored to know which xml's are same and can be grouped into same group. I want some pythonic way to group the same dict's where each dict happens to have same keys.

  • I tried with parsing each dict and creating a string from the values. Store this string inside a dict where key = string and value = list of xmlNames. Now, when I go to the next dict and form the string, if it already exists in the dict, then I simply append the xml to this dict's value.
  • I think there can be a better method based on groupby() or something else.
list_of_xmls =  ["a.xml", "b.xml", "c.xml", "d.xml"]
dictXml = dict()
for xml in list_of_xmls:
    dictXml[xml] = parseXml(xml)   # Returns dict by parsing xml (key-value)

# parseXml(xml)
# It parses xml and returns dict like:
dict for a.xml = {"config":"4", "location":"C:\\xyz", "Group":"amcat"}
dict for b.xml = {"config":"4", "location":"C:\\xyz", "Group":"amcat"}
dict for c.xml = {"config":"5", "location":"C:\\mno", "Group":"alien"}
dict for d.xml = {"config":"5", "location":"C:\\mno", "Group":"alien"}

# Supoose, a.xml and b.xml have same values for all keys
# Same for c.xml and d.xml
# So, I should have two groups (a.xml, b.xml) and (c.xml, d.xml)
 ###########Some processing on the above dict ######

finalOutput = [["a.xml", "b.xml], ["c.xml", "d.xml"]]


Output should be list of groups which can be clubbed (basically list of lists).

Also, dictXml can be any other data structure as well like list of dicts. Any thoughts ?

Basically, the whole idea is given a list of xml's, I need to figure out which xml's are same based on key-values inside it. Group the same xml's in some list and do processing on each group.

PeXXeR
  • 101
  • 1
  • 2
  • 10

3 Answers3

1

You could use itertools.groupby (doc) to do the grouping:

list_of_xmls =  ["a.xml", "b.xml", "c.xml", "d.xml"]

dictXml = {
'a.xml': {"config":"4", "location":"C:\\xyz", "Group":"amcat"},
'c.xml': {"config":"5", "location":"C:\\mno", "Group":"alien"},
'b.xml': {"config":"4", "location":"C:\\xyz", "Group":"amcat"},
'd.xml': {"config":"5", "location":"C:\\mno", "Group":"alien"},
}

from itertools import groupby
from operator import itemgetter

out = []
f = itemgetter(1)
s = sorted([(k, [i for i in v.items()]) for k, v in dictXml.items()], key=f)
for _, g in groupby(s, f):
    out.append([i[0] for i in g])

print(out)

Prints:

[['a.xml', 'b.xml'], ['c.xml', 'd.xml']]
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks for answering. Can you please answer it for python 2.7 or below ? It says : Python version <3.5 doesn't support starred expression in tuple, list and sets. – PeXXeR Jul 26 '19 at 20:02
  • @PeXXeR Updated my answer, tested it with Python 2.7.15 – Andrej Kesely Jul 26 '19 at 20:14
  • Thanks a lot for quickly helping here. May I request you one more thing. I could barely understand what is going on here. Can you please explain briefly ? I am asking this to figure out is there a better way in terms of time complexity after analyzing this or yours might be the best way. – PeXXeR Jul 26 '19 at 20:20
  • @PeXXeR The complexity is O(n logn) because of the `sorted()` function. First we sort `dictXml` by list we create from `dictXml.values`. Then we apply `itertools.groupby` on this sorted list and extract the keys from each group. – Andrej Kesely Jul 26 '19 at 20:27
0

Try this, first I get only the xml name and the group it is in into a list of tuples then applied this grouping algorithm Group list by values

dictXml = {"a.xml":{"Group":"a"}, "b.xml":{"Group":"b"}, "c.xml":{"Group":"b"}, "d.xml":{"Group":"d"}}

xml_group_list = [(xml, xml_dic["Group"]) for xml, xml_dic in dictXml.items()]
values = set(map(lambda x: x[1], xml_group_list))
newlist = [[y[0] for y in xml_group_list if y[1] == x] for x in values]
print(newlist)

output:

[['a.xml'], ['b.xml', 'c.xml'], ['d.xml']]
kkawabat
  • 1,530
  • 1
  • 14
  • 37
0

I have another method to solve your problem. As I do't know parseXML(), so decided to take predefine_dict. hope you understand. you can replace predefined_dict[xml] with parseXML(xml)

list_of_xmls =  ["a.xml", "b.xml", "c.xml", "d.xml"]
predefined_dict = {"a.xml":{"name":"mice", "surename":"dine"},
                     "b.xml":{"name":"akks", "surename":"john"}, 
                     "c.xml":{"name":"mice", "surename":"dine"},
                     "d.xml":{"name":"akks", "surename":"john"}}
dictXml = dict()
finalOutput =[]
temp_dict={}
for xml in list_of_xmls:
    temp = tuple([i for i in predefined_dict[xml].values()])
    print(temp)
    try:
        dictXml[temp].append(xml)
    except KeyError:
        dictXml[temp]= [xml]

print(dictXml)
for value in dictXml.values():
    finalOutput.append(value)
print("finalOutput", finalOutput)
AkshayB
  • 44
  • 6