-1

I have two large lists each with thousands of elements as follows.

I wanted to extract the pair of elements by matching the strings between two lists.

However, it is very slow. How can I speed it up ?

import os, glob

list1 = glob.glob("/data0/*.txt")

list2 = glob.glob("/data1/*.txt")`

with open("result.txt", "w") as fout:
    for i1 in list1:
       tobematched1 = os.path.basename(i1).split(".")[0] + "_" + os.path.basename(i1).split(".")[3]
       for i2 in list2:
         tobematched2 = os.path.basename(i2).split(".")[0] + "_" + os.path.basename(i2).split(".")[3]
         if tobematched1 == tobematched2:
            fout.write(i1 + ";" + i2 + "\n")`

#This problem is not about common elements comparison as in the Common elements comparison between 2 lists

My question is to deal with strings between two lists.

Sara
  • 75
  • 7
  • Exactly what results do you want? What should happen, for example, if `list1` contains the same value five times, and `list2` contains that same value five times as well? Should there be one output? Five? Twenty-five (each pairwise combination)? Something else? Does it matter? My experience has been that questions asking about "matching" or "comparing" elements are almost never specific enough. – Karl Knechtel Mar 22 '23 at 10:55
  • @KarlKnechtel there are only one versus one matches between the lists. – Sara Mar 22 '23 at 10:56
  • So, neither list has any duplicates? I suppose that would make sense, since it seems the data comes from file system listings. – Karl Knechtel Mar 22 '23 at 10:57
  • @KarlKnechtel yes, neither list has duplicates – Sara Mar 22 '23 at 10:58
  • Then this is straightforward, and a very frequently asked question; please see the linked duplicate. – Karl Knechtel Mar 22 '23 at 11:01
  • Using a dictionary can significantly speed up the matching step since dictionary lookup is a constant-time operation. In contrast, the nested loop in your code has a time complexity of O(n^2), which can be very slow for large lists. – JCTL Mar 22 '23 at 11:02
  • @KarlKnechtel #This problem is not about common elements comparison as in the https://stackoverflow.com/questions/2864842/common-elements-comparison-between-2-lists My question is to deal with strings between two lists. – Sara Mar 22 '23 at 11:04
  • 2
    @daniel The linked duplicate is relevant. You'll just need to do the `tobematched` treatment first for each list (making them dicts that map the treated name to the actual name), then apply a set intersection on the dicts' keys. – AKX Mar 22 '23 at 11:06
  • What @AKX said. The problem straightforwardly breaks down into a series of steps - transform the input strings into the possibly-duplicate form, and then check for the duplicates - and I gave you the canonical fro the part that you didn't already show how to do. – Karl Knechtel Mar 22 '23 at 20:44

1 Answers1

1

To do this fast with set intersection, you'll need to apply the transformation (and keep track of the original value), then look that up:

import os
import glob


# Maps a pathname to the part we want to compare
def process_name(item: str) -> str:
    basename_bits = os.path.basename(item).split(".")
    return f"{basename_bits[0]}_{basename_bits[3]}"


# Read the filenames and map them using the transformation above
map1 = {process_name(item): item for item in glob.glob("/data0/*.txt")}
map2 = {process_name(item): item for item in glob.glob("/data1/*.txt")}

# Find the common keys and print the original values.
for common_key in set(map1).intersection(set(map2)):
    print(map1[common_key], map2[common_key])
AKX
  • 152,115
  • 15
  • 115
  • 172