Quick way to compare two big lists of dictionaries?

Question

I have two fairly big list of dictionaries, the length of both is around 1 million dictionaries. What I want to do, is to compare both lists and to detect whether the list1 has any dictionaries that are not present in the list2. I am using the following code to achieve this:

def compare_lists(list1, list2):
    new_items = [i for i in list1 if i not in list2]
    return new_items

It works as intended, but the problem is, it is very slow - with the length of both lists, it takes over an hour to run the comparison.

Is there a way to make it run quicker? I have to compare full dictionaries, not only certain items, as each key:value pair can possibly differ across the two lists.

Interested in how you come to have two lists of ~1m dictionaries each. What is the underlying data? Wonder if there is some other way to solve this using the underlying data. — jarmod, Aug 23 '22 at 13:52
The data for one list is downloaded from by my Postgres DB, the other list is a collection of JSON files, downloaded daily. The goal is to check whether the DB is up to date with the downloaded JSON files. Each JSON file is essentially a dict that could differ from an existing row in the DB or not even exist in the DB. — Aleister Crowley, Aug 23 '22 at 13:58
could you provide us an example of input data for list1 and list2? — Paulo Pereira, Aug 23 '22 at 13:58
If you don't consider the memory problem, you can try to convert the items of each dictionary into a `frozenset` and add it to the `set` for searching (of course, the cost is still very large for 1 million dictionaries). If you are using Python 3.11, you can consider converting to `types.MappingProxyType`. If the dictionary is hashable, the proxy can also be hashed, and the cost of this conversion is much smaller than that of the `frozenset`. — Mechanic Pig, Aug 23 '22 at 14:10
@PauloPereira these are just generic dictionaries, like {"name": "Tom Hanks", "address": "Hollywood", "age": 70} in one list and in the other list you'd have similar dictionary, but for example the age could differ. — Aleister Crowley, Aug 23 '22 at 14:12
You can use the ideas of the answers in this question: https://stackoverflow.com/questions/2703599/what-would-a-frozen-dict-be to make the dictionaries hashable. Comparing hashes (e.g. by using a `set`) will be much easier than comparing `dicts`. — Wups, Aug 23 '22 at 14:23

DarrylG · Accepted Answer · 2022-08-23T18:59:53.933

Approach

Using the idea from this answer Convert a list of dictionaries into a set of dictionaries

Dictionaries in two lists are serialized to string (using json)
Serialized dictionaries are placed into two sets corresponding to each list
Difference is computed between the two sets
Find new elements using set difference
Deserialize elements in the set different to get list of new elements

Code

from json import dumps, loads

def find_difference(lst1, lst2):
    # Returns elements in lst1 which are not in lst2
    set1 = dics_to_set(lst1)
    set2 = dics_to_set(lst2)
    
    # Deserialize elements in set1 that are not in set2
    return [loads(x) for x in set1.difference(set2)]  # items in set1 that are not in set2

def dics_to_set(lst):
    '''
        Convert list of dicts to set
    '''
    return set(dumps(x, sort_keys=True) for x in lst)  # sort_keys to control order of keys

Performance

Summary

With 100K values 2600X faster (2.06 s vs. 9min 3s)

Test setup:

List 2: 100K random dictionaries each with 5 keys
List 1: copy of List 2 plus one additional random dictionary

Test Code

def rand_dicts(n):
    '''
        Create random dictionary of n elements
    '''
    mydict = {}
    for i in range(n):
        mydict[f'key{i}'] = randrange(100)

    return mydict

# List of random dictionaries with 5 elements each
lst2 =  [rand_dicts(5) for _ in range(100000)]

# Copy of list 2 with one more random dictionary added
lst1 = lst2 + [rand_dicts(1)]

Timing using timeit module

# Test of Posted Code
%timeit [x for x in lst1 if x not in lst2] 
# Output: 9min 3s ± 13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Test of proposed code
%timeit find_difference(lst1, lst2)
Output: 2.06 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The solution works and the increase is speed is mind blowing - close two sets of dictionaries, 800k dicts in length each, compared in just 4 seconds. — Aleister Crowley, Aug 23 '22 at 15:29
@AleisterCrowley -- Great to hear. I was going to post a time comparison using faked lists but the base method was taking too long on my machine so gave up on it. — DarrylG, Aug 23 '22 at 15:38

Quick way to compare two big lists of dictionaries?

1 Answers1