Speeding up string matching between two large lists

Question

I have two large lists each with thousands of elements as follows.

I wanted to extract the pair of elements by matching the strings between two lists.

However, it is very slow. How can I speed it up ?

import os, glob

list1 = glob.glob("/data0/*.txt")

list2 = glob.glob("/data1/*.txt")`

with open("result.txt", "w") as fout:
    for i1 in list1:
       tobematched1 = os.path.basename(i1).split(".")[0] + "_" + os.path.basename(i1).split(".")[3]
       for i2 in list2:
         tobematched2 = os.path.basename(i2).split(".")[0] + "_" + os.path.basename(i2).split(".")[3]
         if tobematched1 == tobematched2:
            fout.write(i1 + ";" + i2 + "\n")`

#This problem is not about common elements comparison as in the Common elements comparison between 2 lists

My question is to deal with strings between two lists.

Exactly what results do you want? What should happen, for example, if `list1` contains the same value five times, and `list2` contains that same value five times as well? Should there be one output? Five? Twenty-five (each pairwise combination)? Something else? Does it matter? My experience has been that questions asking about "matching" or "comparing" elements are almost never specific enough. — Karl Knechtel, Mar 22 '23 at 10:55
@KarlKnechtel there are only one versus one matches between the lists. — Sara, Mar 22 '23 at 10:56
So, neither list has any duplicates? I suppose that would make sense, since it seems the data comes from file system listings. — Karl Knechtel, Mar 22 '23 at 10:57
Then this is straightforward, and a very frequently asked question; please see the linked duplicate. — Karl Knechtel, Mar 22 '23 at 11:01
Using a dictionary can significantly speed up the matching step since dictionary lookup is a constant-time operation. In contrast, the nested loop in your code has a time complexity of O(n^2), which can be very slow for large lists. — JCTL, Mar 22 '23 at 11:02
@KarlKnechtel #This problem is not about common elements comparison as in the https://stackoverflow.com/questions/2864842/common-elements-comparison-between-2-lists My question is to deal with strings between two lists. — Sara, Mar 22 '23 at 11:04
@daniel The linked duplicate is relevant. You'll just need to do the `tobematched` treatment first for each list (making them dicts that map the treated name to the actual name), then apply a set intersection on the dicts' keys. — AKX, Mar 22 '23 at 11:06
What @AKX said. The problem straightforwardly breaks down into a series of steps - transform the input strings into the possibly-duplicate form, and then check for the duplicates - and I gave you the canonical fro the part that you didn't already show how to do. — Karl Knechtel, Mar 22 '23 at 20:44

score 1 · Accepted Answer · answered Mar 22 '23 at 11:09

To do this fast with set intersection, you'll need to apply the transformation (and keep track of the original value), then look that up:

import os
import glob


# Maps a pathname to the part we want to compare
def process_name(item: str) -> str:
    basename_bits = os.path.basename(item).split(".")
    return f"{basename_bits[0]}_{basename_bits[3]}"


# Read the filenames and map them using the transformation above
map1 = {process_name(item): item for item in glob.glob("/data0/*.txt")}
map2 = {process_name(item): item for item in glob.glob("/data1/*.txt")}

# Find the common keys and print the original values.
for common_key in set(map1).intersection(set(map2)):
    print(map1[common_key], map2[common_key])

Speeding up string matching between two large lists

1 Answers1