-3

I have a list of dictionaries and I want to group this dictionaries having the most common key value pairs . For example the list of dicts is :

data_dict = [{'a':1 , 'b':2,'c':3 , 'd':4} , {'a':1,'b':2 ,'e':1} , {'a':1 ,'b':2,'c':3 ,'z':5} , {'a':1 , 'b':2 , 'j':1}]

The lenght of the data_dict can be upto 4000 .

output = [(0,2),(1,3)]

Here in the output elements represent the index of the dictionaries in data_dict .

So Dictionary at index 0 and 2 had the most common same key-value pairs . Hence they are grouped together . Similarly the dictionary at index 1 and 3 have the most common key value pair among them.

priyansh jain
  • 41
  • 2
  • 4

1 Answers1

0

First, convert the dictionaries to sets of items key-value:

>>> data_dict = [{'a':1 , 'b':2,'c':3 , 'd':4} , {'a':1,'b':2 ,'e':1} , {'a':1 ,'b':2,'c':3 ,'z':5} , {'a':1 , 'b':2 , 'j':1}]
>>> L = [set(d.items()) for d in data_dict]
>>> L
[{('c', 3), ('a', 1), ('b', 2), ('d', 4)}, {('a', 1), ('e', 1), ('b', 2)}, {('c', 3), ('a', 1), ('b', 2), ('z', 5)}, {('j', 1), ('a', 1), ('b', 2)}]

Now, you want to groups sets by pairs having the most items in common. You can use a naive algorithm in O(N^3): check every set against all the others for the first pair, and do it again until all pairs are found:

N = len(L)
left = set(range(N))
while left:
    c = -1
    pair = None
    for i in left:
        for j in left & set(range(i+1, N)):
            t = len(L[i] & L[j])
            if t > c:
                pair = i, j
                c = t

    left.difference_update(ret) # remove the current pair
    print(ret)

    # (0, 2)
    # (1, 3)

But that's a bit slow (4000^3 = 64 billions).

Another idea is to create a dict item -> list of all sets containing that item. Then transform this dict into a dict item -> pairs of set containing that item (combinations of two elements of the previous list). Finally, consider all the pairs created an choose the most frequent (see https://stackoverflow.com/a/26028865/6914441 for the idea).

This might be slower in the worst case since you have to create all combinations, but could be faster in practice if the key-value pairs are shared at most by a limited number of dicts.

indices_by_item = {}
for i, items in enumerate(L):
    for item in items:
        indices_by_item.setdefault(item, []).append(i)

# {('a', 1): [0, 1, 2, 3], ('d', 4): [0], ('c', 3): [0, 2], ('b', 2): [0, 1, 2, 3], ('e', 1): [1], ('z', 5): [2], ('j', 1): [3]}

# compute the combination
import itertools
pairs_by_item = {item: list(itertools.combinations(indices, 2)) for item, indices in indices_by_item.items()}

# {('a', 1): [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)], ('d', 4): [], ('c', 3): [(0, 2)], ('b', 2): [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)], ('e', 1): [], ('z', 5): [], ('j', 1): []}

import collections
c = collections.Counter([pair for pairs in pairs_by_item.values() for pair in pairs])

# Counter({(0, 2): 3, (0, 1): 2, (0, 3): 2, (1, 2): 2, (1, 3): 2, (2, 3): 2})

while c:
    pair, _ = c.most_common(1)[0]
    print(pair)
    # remove all the pairs having one element of the best pair
    for other_pair in list(c):
        if pair[0] in other_pair or pair[1] in other_pair:
            del c[other_pair]

# (0, 2)
# (1, 3)
jferard
  • 7,835
  • 2
  • 22
  • 35