3

I presently have the ability to remove duplicates if there is no key in front of the nested dictionary. An example of my list of dicts that works with this function is:

 [{'asndb_prefix': '164.39.xxx.0/17',
  'cidr': '164.39.xxx.0/17',
  'cymru_asn': 'XXX',
  'cymru_country': 'GB',
  'cymru_owner': 'XXX , GB',
  'cymru_prefix': '164.39.xxx.0/17',
  'ips': ['164.39.xxx.xxx'],
  'network_id': '164.39.xxx.xxx/24',},
 {'asndb_prefix': '54.192.xxx.xxx/16',
  'cidr': '54.192.0.0/16',
  'cymru_asn': '16509',
  'cymru_country': 'US',
  'cymru_owner': 'AMAZON-02 - Amazon.com, Inc., US',
  'cymru_prefix': '54.192.144.0/22',
  'ips': ['54.192.xxx.xxx', '54.192.xxx.xxx'],
  'network_id': '54.192.xxx.xxx/24',
  }]

def remove_dict_duplicates(list_of_dicts):
    """
    "" Remove duplicates in dict 
    """
    list_of_dicts = [dict(t) for t in set([tuple(d.items()) for d in list_of_dicts])]
    # remove the {} before and after - not sure why these are placed as 
    # the first and last element 
    return list_of_dicts[1:-1]

However, I would like to be able to remove duplicates based on the key and all values associated within that dictionary. So if there the same key with different values inside I would like to not remove it, but if there is a complete copy then remove it.

    [{'50.16.xxx.0/24': {'asndb_prefix': '50.16.0.0/16',
   'cidr': '50.16.0.0/14',
   'cymru_asn': 'xxxx',
   'cymru_country': 'US',
   'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',
   'cymru_prefix': '50.16.0.0/16',
   'ip': '50.16.221.xxx',
   'network_id': '50.16.xxx.0/24',
   'pyasn_asn': xxxx,
   'whois_asn': 'xxxx'}},
   // This would be removed
   {'50.16.xxx.0/24': {'asndb_prefix': '50.16.0.0/16',
   'cidr': '50.16.0.0/14',
   'cymru_asn': 'xxxxx',
   'cymru_country': 'US',
   'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',
   'cymru_prefix': '50.16.0.0/16',
   'ip': '50.16.221.xxx',
   'network_id': '50.16.xxx.0/24',
   'pyasn_asn': xxxx,
   'whois_asn': 'xxxx'}},
   // This would NOT be removed
   {'50.16.xxx.0/24': {'asndb_prefix': '50.999.0.0/16',
   'cidr': '50.999.0.0/14',
   'cymru_asn': 'xxxx',
   'cymru_country': 'US',
   'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',
   'cymru_prefix': '50.16.0.0/16',
   'ip': '50.16.221.xxx',
   'network_id': '50.16.xxx.0/24',
   'pyasn_asn': xxxx,
   'whois_asn': 'xxxx'}}]

How do I go about doing this? Thank you.

wim
  • 338,267
  • 99
  • 616
  • 750
John Z
  • 464
  • 1
  • 6
  • 17

2 Answers2

6

To remove duplicates from a list of dicts:

list_of_unique_dicts = []
for dict_ in list_of_dicts:
    if dict_ not in list_of_unique_dicts:
        list_of_unique_dicts.append(dict_)
wim
  • 338,267
  • 99
  • 616
  • 750
2

If the order in the result is not important, you can use a set to remove the duplicates by converting the dicts into frozen sets:

def remove_dict_duplicates(list_of_dicts):
    """
    Remove duplicates.
    """
    packed = set(((k, frozenset(v.items())) for elem in list_of_dicts for
                 k, v in elem.items()))
    return [{k: dict(v)} for k, v in packed]

This assumes that all values of the innermost dicts are hashable.

​Giving up the order yields potential speedups for large lists. For example, creating a list with 100,000 elements:

inner = {'asndb_prefix': '50.999.0.0/16',
         'cidr': '50.999.0.0/14',
         'cymru_asn': '14618',
         'cymru_country': 'US',    
         'cymru_owner': 'AMAZON-AES - Amazon.com, Inc., US',    
         'cymru_prefix': '50.16.0.0/16',    
         'ip': '50.16.221.xxx',    
         'network_id': '50.16.xxx.0/24',    
         'pyasn_asn': 14618,    
          'whois_asn': '14618'}

large_list = list_of_dicts + [{x: inner} for x in range(int(1e5))]

It takes quite a while checking for duplicates in the result list again and again:

def remove_dupes(list_of_dicts):
    """Source: answer from wim
    """ 
    list_of_unique_dicts = []
    for dict_ in list_of_dicts
        if dict_ not in list_of_unique_dicts:
            list_of_unique_dicts.append(dict_)
    return list_of_unique_dicts

%timeit  remove_dupes(large_list
1 loop, best of 3: 2min 55s per loop

My approach, using a set is a bit faster:

%timeit remove_dict_duplicates(large_list)
1 loop, best of 3: 590 ms per loop
Mike Müller
  • 82,630
  • 20
  • 166
  • 161