2

I have a small bit of code to list file names that match a filter string. I'm trying to extend this to match against multiple filters. I have some working code which takes a very straight forward loop approach but it is sloooooooow.... basically running os.walk for every filter.

Given the function (shown below) is there a way to test against multiple filters at once, rather than 1 at a time? i.e. can I past a list of filter strings to find_files?

import os
import fnmatch

# stolen from http://stackoverflow.com/questions/8625991/use-python-os-walk-to-identify-a-list-of-files
def find_files(dir_look, filt):
    matches = []
    for root, dirnames, filenames in os.walk(dir_look):
      for filename in fnmatch.filter(filenames, filt):
          matches.append(os.path.join(root, filename))
    return matches

#create empty list to store results
filelist=[]

#some example filters, my real data has about 5000 filters
filts = [r'*60830007*',r'*60910259*',r'*60910299*']

#find files for each filter entry
for filter in filts:
    filelist.append(find_files(r'C:\some directory', filter))

EDIT:

I found a rather obvious way to speed things up by passing the list of filters to the function then testing each inside the os.walk

def find_files(dir_look, filters):
    matches = []
    for root, dirnames, filenames in os.walk(dir_look):
        for filt in filters:
            for filename in fnmatch.filter(filenames, filt):
                matches.append(os.path.join(root, filename))
    return matches
jprockbelly
  • 1,533
  • 15
  • 30
  • 1
    Now for each filter, the os.walk is executed. this is slow. Maybe better to read it once, and than just run filters on it? – Marcin Dec 11 '14 at 00:16

2 Answers2

2

This answer will be about algorithms and data structures, instead of python programming.

  1. If you want to test lot of pattern against a string then you should choose a better structure for representation. Instead of char array we use suffix-trees. (For python implementations see this question.

  2. If some of your filters has common parts (especially if they have the same prefix) you should represent them as trie(s). So this way you can test simultaneously with more than one pattern. This solution creates an overhead of building the tree(s) but if you use the same filters multiple times then it's worth.

  • Thanks Adam, I hadn't seen suffix-trees before but will definitely use them in future. For this problem I want some specific numbers, but not the numbers in between i.e. I want 60830007 and 60910259 but not the 100,000 numbers in between. Is it still worth building a suffix tree? – jprockbelly Dec 11 '14 at 05:06
  • 1
    Yes, (I think) it's woth. If you build the suffix-tree for file names with the length of n in O(n)-O(nlogn) time then the matching will be much faster because you only need to find a branch in the tree that contains the filter instead going through almost each letter. Even better running time can be achieved if your filters only contains numbers, because in this case you only need to build a (much smaller) suffix tree with the number parts of a file name. – Ádám Nagy Dec 11 '14 at 19:33
-1

check out glob2. It is fast and robust.

nagordon
  • 1,307
  • 2
  • 13
  • 16