Speed up file matching based on names of files

Question

so I have 2 directories with 2 different file types (eg .csv, .png) but with the same basename (eg 1001_12_15.csv, 1001_12_15.png). I have many thousands of files in each directory.

What I want to do is to get the full paths of files, after having matched the basenames and then DO something with th efull path of both files.

I am asking some help of how to speed up the procedure.

My approach is:

csvList=[a list with the full path of each .csv file]
pngList=[a list with the full path of each .png file]



for i in range(0,len(csvlist)):
    csv_base = os.path.basename(csvList[i])
    #eg 1001
    csv_id = os.path.splitext(fits_base)[0].split("_")[0]

    for j in range(0, len(pngList)):
        png_base = os.path.basename(pngList[j])
        png_id = os.path.splitext(png_base)[0].split("_")[0]
        if float(png_id) == float(csv_id):
            DO SOMETHING

more over I tried fnmatch something like:

for csv_file in csvList:
    try:
        csv_base = os.path.basename(csv_file)

        csv_id = os.path.splitext(csv_base)[0].split("_")[0]

        rel_path = "/path/to/file"
        pattern = "*" + csv_id + "*.png"

        reg_match = fnmatch.filter(pngList, pattern)
        reg_match=" ".join(str(x) for x in reg_match)
        if reg_match:
            DO something

It seems that using the nested for loops is faster. But I want it to be even faster. Are there any other approaches that I could speed up my code?

list comprehension is normally much cleaner and faster than loops — Piakkaa, Jul 21 '19 at 11:19
You are needlessly searching for an exact match. Why don’t you just check whether the png for an given csv actually exists? If you want to do this in-memory, why not use a set for O(1) lookup? — MisterMiyagi, Jul 21 '19 at 12:18
Your description mentions that files match the entire name but the extension. Your code only compares the start of the basename before an underscore. What kind of matches are you looking for? — MisterMiyagi, Jul 21 '19 at 12:32

Piakkaa · Answer 1 · 2019-07-21T12:40:25.857

first of all, optimize syntax on your existing loop like this

for csv in csvlist:
    csv_base = os.path.basename(csv)
    csv_id = os.path.splitext(csv_base)[0].split("_")[0]

    for png in pnglist:
        png_base = os.path.basename(png)
        png_id = os.path.splitext(png_base)[0].split("_")[0]
        if float(png_id) == float(csv_id):
            #do something here

nested loops are very slow because you need to run png loop n2 times

Then you can use list comprehension and array index to speed it up more

## create lists of processed values 
## so you dont have to keep running the os library
sv_base_list=[os.path.basename(csv) for csv in csvlist]
csv_id_list=[os.path.splitext(csv_base)[0].split("_")[0] for csv_base in csv_base_list]
png_base_list=[os.path.basename(png) for png in pnglist]
png_id_list=[os.path.splitext(png_base)[0].split("_")[0] for png_base in png_base_list]


## run a single loop with list.index to find matching pair and record base values array

csv_png_base=[(csv_base_list[csv_id_list.index(png_id)], png_base)\
                   for png_id,png_base in zip(png_id_list,png_base_list)\
                   if png_id in csv_id_list]

## csv_png_base contains a tuple contianing (csv_base,png_base)

this logic using list index reduces the loop count significantly and there is no repetitive os lib calls
list comprehension is slightly faster than normal loop

You can loop through the list and do something with the values eg

for csv_base,png_base in csv_png_base:
    #do something

pandas will do the job much much faster though because it will run the loop using a C library

In my experience, it is faster to write a csv file parser in Python which returns a dict which you can then convert to pandas df. Did you ever try that? — FObersteiner, Jul 21 '19 at 12:18
List comprehensions are comparable in performance to for loops. Moreover, they are significantly more readable than the blob of code of multiple list comprehensions. — MisterMiyagi, Jul 21 '19 at 12:19
well just trying to stick with his methods. plus array index will save a lot of loop counts wouldn't it — Piakkaa, Jul 21 '19 at 12:22
list.index runs an internal loop to find the item. It is still O(n). — MisterMiyagi, Jul 21 '19 at 13:00

score 0 · Answer 2 · answered Jul 21 '19 at 12:58

You can build up a search index in O(n), then seek items in it in O(1) each. If you have exact matches as your question implies, a flat lookup dict suffices:

from os.path import basename, splitext

png_lookup = {
    splitext(basename(png_path))[0] : png_path
    for png_path in pngList
}

This allows you to directly look up the png file corresponding to each csv file:

for csv_file in csvList:
    csv_id = splitext(basename(csv_file)[0]
    try:
        png_file = png_lookup[csv_id]
    except KeyError:
        pass
    else:
        # do something

In the end, you have an O(n) lookup construction and a separate O(n) iteration with a nested O(1) lookup. The total complexity is O(n) compared to your initial O(n^2).

Speed up file matching based on names of files

2 Answers2