1

I have two fairly big list of dictionaries, the length of both is around 1 million dictionaries. What I want to do, is to compare both lists and to detect whether the list1 has any dictionaries that are not present in the list2. I am using the following code to achieve this:

def compare_lists(list1, list2):
    new_items = [i for i in list1 if i not in list2]
    return new_items

It works as intended, but the problem is, it is very slow - with the length of both lists, it takes over an hour to run the comparison.

Is there a way to make it run quicker? I have to compare full dictionaries, not only certain items, as each key:value pair can possibly differ across the two lists.

Aleister Crowley
  • 443
  • 1
  • 6
  • 16
  • 3
    Interested in how you come to have two lists of ~1m dictionaries each. What is the underlying data? Wonder if there is some other way to solve this using the underlying data. – jarmod Aug 23 '22 at 13:52
  • The data for one list is downloaded from by my Postgres DB, the other list is a collection of JSON files, downloaded daily. The goal is to check whether the DB is up to date with the downloaded JSON files. Each JSON file is essentially a dict that could differ from an existing row in the DB or not even exist in the DB. – Aleister Crowley Aug 23 '22 at 13:58
  • could you provide us an example of input data for list1 and list2? – Paulo Pereira Aug 23 '22 at 13:58
  • 1
    If you don't consider the memory problem, you can try to convert the items of each dictionary into a `frozenset` and add it to the `set` for searching (of course, the cost is still very large for 1 million dictionaries). If you are using Python 3.11, you can consider converting to `types.MappingProxyType`. If the dictionary is hashable, the proxy can also be hashed, and the cost of this conversion is much smaller than that of the `frozenset`. – Mechanic Pig Aug 23 '22 at 14:10
  • @PauloPereira these are just generic dictionaries, like {"name": "Tom Hanks", "address": "Hollywood", "age": 70} in one list and in the other list you'd have similar dictionary, but for example the age could differ. – Aleister Crowley Aug 23 '22 at 14:12
  • 2
    You can use the ideas of the answers in this question: https://stackoverflow.com/questions/2703599/what-would-a-frozen-dict-be to make the dictionaries hashable. Comparing hashes (e.g. by using a `set`) will be much easier than comparing `dicts`. – Wups Aug 23 '22 at 14:23

1 Answers1

3

Approach

Using the idea from this answer Convert a list of dictionaries into a set of dictionaries

  • Dictionaries in two lists are serialized to string (using json)
  • Serialized dictionaries are placed into two sets corresponding to each list
  • Difference is computed between the two sets
  • Find new elements using set difference
  • Deserialize elements in the set different to get list of new elements

Code

from json import dumps, loads

def find_difference(lst1, lst2):
    # Returns elements in lst1 which are not in lst2
    set1 = dics_to_set(lst1)
    set2 = dics_to_set(lst2)
    
    # Deserialize elements in set1 that are not in set2
    return [loads(x) for x in set1.difference(set2)]  # items in set1 that are not in set2

def dics_to_set(lst):
    '''
        Convert list of dicts to set
    '''
    return set(dumps(x, sort_keys=True) for x in lst)  # sort_keys to control order of keys

Performance

Summary

  • With 100K values 2600X faster (2.06 s vs. 9min 3s)

Test setup:

  • List 2: 100K random dictionaries each with 5 keys
  • List 1: copy of List 2 plus one additional random dictionary

Test Code

def rand_dicts(n):
    '''
        Create random dictionary of n elements
    '''
    mydict = {}
    for i in range(n):
        mydict[f'key{i}'] = randrange(100)

    return mydict

# List of random dictionaries with 5 elements each
lst2 =  [rand_dicts(5) for _ in range(100000)]

# Copy of list 2 with one more random dictionary added
lst1 = lst2 + [rand_dicts(1)] 

Timing using timeit module

# Test of Posted Code
%timeit [x for x in lst1 if x not in lst2] 
# Output: 9min 3s ± 13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Test of proposed code
%timeit find_difference(lst1, lst2)
Output: 2.06 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • The solution works and the increase is speed is mind blowing - close two sets of dictionaries, 800k dicts in length each, compared in just 4 seconds. – Aleister Crowley Aug 23 '22 at 15:29
  • 1
    @AleisterCrowley -- Great to hear. I was going to post a time comparison using faked lists but the base method was taking too long on my machine so gave up on it. – DarrylG Aug 23 '22 at 15:38