Check if elements of list A are in list B of five million entry too slow

Question

The following function was used to check if elements of list A(files_list) are in list B(reference_list). However, the reference_list consisted five million entry, and checking the code take too much time.

def check_files_in_list(file_list, reference_list):
    not_found_files = []
    for file_path in file_list:
        if file_path not in reference_list:
            not_found_files.append(file_path)
    return not_found_files

Is there any way to accelerate this process? A different data structure, etc.? I tried some simple parallelization, but it did not accelerate the process. Had python already implemented the parallelization if file_path not in reference_list by default?

How many elements in A? If A is not small relative to B, then it would likely make sense to convert `reference_list` to a set. Checking the presence of an item in a list is O(N) time complexity, but for a set it's O(1). https://wiki.python.org/moin/TimeComplexity — slothrop, Jun 24 '23 at 22:04
@JohnGordon The item in the list is known to be distinct. I saw the "Worst Case" in slothrop's comment is also "O(n)" — ShoutOutAndCalculate, Jun 24 '23 at 22:34
@slothrop The list A is also around five million elements. Basically it's to build a synchronization state. Does parallelization work on set or if it has already been implemented? — ShoutOutAndCalculate, Jun 24 '23 at 22:38
There's very little parallelization in Python by default. You get some with numpy, in the sense that numpy uses the CPU's multiple-data operations (like SIMD) where possible; otherwise you need threads or multiprocessing in order to parallelize. As @Henrique said, the speedup of a set over a list comes from using a hashtable, which means that item lookup can be done much faster. (It's true that in the pathological worst case where every item has the same hash value, set lookup would be O(N), but for realistic datasets it's going to be very close to O(1).) — slothrop, Jun 25 '23 at 07:56

score 1 · Accepted Answer · answered Jun 24 '23 at 23:49

1

You'll be hard-pressed to find anything better than O(n) if you need to compare between all the values in the list. But your code is O(n²) since the two lists are similar in size, so O(n) isn't that bad.

If you use set as described in the comments, you can perform the intersection like this.

s = {"A", "B", "C", "D", "E"}
t = {"A", "C", "D"}

print(s - t)  # prints {'E', 'B'}

This, according to the docs, gives you O(len(s)). Note that "s" is the largest list

answered Jun 24 '23 at 23:49

Henrique

328
1
2
10

Thank you. Somehow it indeed worked. However, could you explain a bit of why the set was performed faster than the list? I wanted to know if it's possible to use some structure to further accelerate this process. – ShoutOutAndCalculate Jun 24 '23 at 23:51
1

Sets are implemented using hash tables: https://stackoverflow.com/questions/8929284/what-makes-sets-faster-than-lists – Henrique Jun 24 '23 at 23:57

Check if elements of list A are in list B of five million entry too slow

1 Answers1