Remove duplicates from list of dicts

Question

I presently have the ability to remove duplicates if there is no key in front of the nested dictionary. An example of my list of dicts that works with this function is:

 [{'asndb_prefix': '164.39.xxx.0/17',
  'cidr': '164.39.xxx.0/17',
  'cymru_asn': 'XXX',
  'cymru_country': 'GB',
  'cymru_owner': 'XXX , GB',
  'cymru_prefix': '164.39.xxx.0/17',
  'ips': ['164.39.xxx.xxx'],
  'network_id': '164.39.xxx.xxx/24',},
 {'asndb_prefix': '54.192.xxx.xxx/16',
  'cidr': '54.192.0.0/16',
  'cymru_asn': '16509',
  'cymru_country': 'US',
  'cymru_owner': 'AMAZON-02 - Amazon.com, Inc., US',
  'cymru_prefix': '54.192.144.0/22',
  'ips': ['54.192.xxx.xxx', '54.192.xxx.xxx'],
  'network_id': '54.192.xxx.xxx/24',
  }]

def remove_dict_duplicates(list_of_dicts):
    """
    "" Remove duplicates in dict 
    """
    list_of_dicts = [dict(t) for t in set([tuple(d.items()) for d in list_of_dicts])]
    # remove the {} before and after - not sure why these are placed as 
    # the first and last element 
    return list_of_dicts[1:-1]

However, I would like to be able to remove duplicates based on the key and all values associated within that dictionary. So if there the same key with different values inside I would like to not remove it, but if there is a complete copy then remove it.

    [{'50.16.xxx.0/24': {'asndb_prefix': '50.16.0.0/16',
   'cidr': '50.16.0.0/14',
   'cymru_asn': 'xxxx',
   'cymru_country': 'US',
   'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',
   'cymru_prefix': '50.16.0.0/16',
   'ip': '50.16.221.xxx',
   'network_id': '50.16.xxx.0/24',
   'pyasn_asn': xxxx,
   'whois_asn': 'xxxx'}},
   // This would be removed
   {'50.16.xxx.0/24': {'asndb_prefix': '50.16.0.0/16',
   'cidr': '50.16.0.0/14',
   'cymru_asn': 'xxxxx',
   'cymru_country': 'US',
   'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',
   'cymru_prefix': '50.16.0.0/16',
   'ip': '50.16.221.xxx',
   'network_id': '50.16.xxx.0/24',
   'pyasn_asn': xxxx,
   'whois_asn': 'xxxx'}},
   // This would NOT be removed
   {'50.16.xxx.0/24': {'asndb_prefix': '50.999.0.0/16',
   'cidr': '50.999.0.0/14',
   'cymru_asn': 'xxxx',
   'cymru_country': 'US',
   'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',
   'cymru_prefix': '50.16.0.0/16',
   'ip': '50.16.221.xxx',
   'network_id': '50.16.xxx.0/24',
   'pyasn_asn': xxxx,
   'whois_asn': 'xxxx'}}]

How do I go about doing this? Thank you.

Sorry, I must be misusing terminology. Isn't '23.21.xxx.0/24': considered a key? In the second code block — John Z, Jan 17 '17 at 18:21

score 6 · Accepted Answer · answered Jan 17 '17 at 18:44

6

To remove duplicates from a list of dicts:

list_of_unique_dicts = []
for dict_ in list_of_dicts:
    if dict_ not in list_of_unique_dicts:
        list_of_unique_dicts.append(dict_)

answered Jan 17 '17 at 18:44

wim

338,267
99
616
750

Thank you very much. I am also very surprised how "in" works this way as well. – John Z Jan 17 '17 at 19:17

Mike Müller · Answer 2 · 2017-01-17T20:19:17.863

If the order in the result is not important, you can use a set to remove the duplicates by converting the dicts into frozen sets:

def remove_dict_duplicates(list_of_dicts):
    """
    Remove duplicates.
    """
    packed = set(((k, frozenset(v.items())) for elem in list_of_dicts for
                 k, v in elem.items()))
    return [{k: dict(v)} for k, v in packed]

This assumes that all values of the innermost dicts are hashable.

Giving up the order yields potential speedups for large lists. For example, creating a list with 100,000 elements:

inner = {'asndb_prefix': '50.999.0.0/16',
         'cidr': '50.999.0.0/14',
         'cymru_asn': '14618',
         'cymru_country': 'US',    
         'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',    
         'cymru_prefix': '50.16.0.0/16',    
         'ip': '50.16.221.xxx',    
         'network_id': '50.16.xxx.0/24',    
         'pyasn_asn': 14618,    
          'whois_asn': '14618'}

large_list = list_of_dicts + [{x: inner} for x in range(int(1e5))]

It takes quite a while checking for duplicates in the result list again and again:

def remove_dupes(list_of_dicts):
    """Source: answer from wim
    """ 
    list_of_unique_dicts = []
    for dict_ in list_of_dicts
        if dict_ not in list_of_unique_dicts:
            list_of_unique_dicts.append(dict_)
    return list_of_unique_dicts

%timeit  remove_dupes(large_list
1 loop, best of 3: 2min 55s per loop

My approach, using a set is a bit faster:

%timeit remove_dict_duplicates(large_list)
1 loop, best of 3: 590 ms per loop

You should mention the additional limitation this approach has: all values must be hashable. — wim, Jan 17 '17 at 19:12
This does work, however, it does not keep order like you said. I believe the approach I selected as my answer will work best for my application. — John Z, Jan 17 '17 at 19:17
I agree. I will 100% keep this in mind. For now, the list is very manageable. I appreciate the explanation! — John Z, Jan 17 '17 at 20:17

Remove duplicates from list of dicts

2 Answers2

Linked