How to remove duplicate values from defaultdict(int) in python?

Question

I have a dictionary which consists of unique key value pairs as follows:

    edu_bg={1: 1, 2: 2, 3: 1, 4: 1, 5: 2, 6: 2}

I want to create a dictionary from above data so that similar records (which has similar values) are grouped together as follows:

   {(1, 3): 1, (5, 6): 1, (1, 4): 1, (2, 6): 1,(2, 5): 1, (3, 4): 1})

I tried to achieve above output using following code:

    myedu = defaultdict(int)
    for k,v in edu_bg.iteritems():
        for K,V in edu_bg.iteritems():
          if K == k and V == v:
              pass
          if K != k and V == v:
              myedu[(k,K)] += 1
          else:
              pass

However it has resulted in duplicate records as follows:

         defaultdict(<type 'int'>, {(1, 3): 1, (5, 6): 1, (4, 1): 1, (3, 1): 1, (5, 2): 1, (1, 4): 1, (2, 6): 1, (4, 3): 1, (6, 2): 1, (2, 5): 1, (3, 4): 1, (6, 5): 1})

I want to remove these duplicate values. Any advice on this problem is appreciated.

I'm not sure I understand your requirements. Why are (1,3) and (1,4) different? Why not (1,3,4)? This sounds like an XY problem... — juanpa.arrivillaga, Nov 13 '16 at 01:58
Take a look at http://stackoverflow.com/q/773/78845 . It may help you group your elements without writing lots of code. — johnsyweb, Nov 13 '16 at 02:34

Dan D. · Answer 1 · 2016-11-13T03:06:43.133

Inverting the mapping and taking the combinations of the groups of keys partitioned by their values.

>>> edu_bg={1: 1, 2: 2, 3: 1, 4: 1, 5: 2, 6: 2}
>>> def invert(d):
...     i={}
...     for k,v in d.iteritems():
...             i.setdefault(v,[]).append(k)
...     return i
... 
>>> invert(edu_bg)
{1: [1, 3, 4], 2: [2, 5, 6]}

Then for each of the sublists you compute combinations(sublist, 2):

>>> [comb for sublist in {1: [1, 3, 4], 2: [2, 5, 6]}.values() for comb in combinations(sublist, 2)]
[(1, 3), (1, 4), (3, 4), (2, 5), (2, 6), (5, 6)]

All the counts will always be 1 as each combination is generated only once. Because of that we can generate the requested output simply:

>>> dict.fromkeys([(1, 3), (1, 4), (3, 4), (2, 5), (2, 6), (5, 6)], 1)
{(2, 6): 1, (5, 6): 1, (1, 4): 1, (1, 3): 1, (2, 5): 1, (3, 4): 1}

Each step combined

>>> dict.fromkeys([comb for sublist in invert(edu_bg).values() for comb in combinations(sublist, 2)],1)
{(2, 6): 1, (5, 6): 1, (1, 4): 1, (1, 3): 1, (2, 5): 1, (3, 4): 1}

This costs a lot less than either iterating over the product or the combinations and filtering. This also generates all the output without duplicates.

score 1 · Accepted Answer · edited Mar 20 '17 at 10:29

Instead of iterating over the cartesian product of every pair, which will iterated over exactly n^2 elements, you could just iterate over every possible combination, which will iterate over n(n-1)/2 elements. While the Big Oh complexity will be the same, the constant factors will be reduced significantly:

>>> from collections import defaultdict
>>> from itertools import combinations
>>> myedu = defaultdict(int)
>>> edu_bg={1: 1, 2: 2, 3: 1, 4: 1, 5: 2, 6: 2}
>>> for k1,k2 in combinations(edu_bg,2):
...   if edu_bg[k1] == edu_bg[k2]:
...     myedu[(k1,k2)] += 1
... 
>>> myedu
defaultdict(<class 'int'>, {(2, 6): 1, (1, 3): 1, (5, 6): 1, (2, 5): 1, (3, 4): 1, (1, 4): 1})
>>>

I should reiterate, though, this sounds like the XY problem...

Mark Hannel · Answer 3 · 2016-11-13T02:26:46.380

You are comparing each key against all the other keys, including themselves. These leads to n**2 checks including two cases you don't want to consider: keys compared to themselves and keys compared twice. You've correctly passed on cases where the keys are equal. You can pass on cases where a key k is smaller than the key K and this will dismiss the half of the checks that would have been duplicates.

from collections import defaultdict

edu_bg = {1:1, 2:2, 3:1, 4:1, 5:2, 6:2}
myedu = defaultdict(int)
for k,v in edu_bg.iteritems():
    for K,V in edu_bg.iteritems():
        if K == k and V == v:
            pass
        if K > k and V == v:
            myedu[(k,K)] += 1
        else:
            pass
print myedu

Ideally you would limit the for loops such that the cases of keys being equally or twice counted would never happen. You can do that in the following way:

edu_bg = {1:1, 2:2, 3:1, 4:1, 5:2, 6:2}
myedu = defaultdict(int)
keys = edu_bg.keys()
for i in xrange(len(keys)):
    for j in xrange(i+1, len(keys)):
        k = keys[i]
        K = keys[j]
        if edu_bg[k] == edu_bg[K]:
            myedu[(k,K)] += 1         

print myedu

How to remove duplicate values from defaultdict(int) in python?

3 Answers3