Parallel file matching, Python

Question

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as our current implementation is basically a bash script find\grep combo. The bash script takes 358 seconds on my benchmark directory. I was able to write a python script that did this in 72 seconds but want to improve more. First I will post the base-code and then tweaks I have tried:

import os, sys, Queue, threading, re

fileList = []
rootDir = sys.argv[1]

class Recurser(threading.Thread):

    def __init__(self, queue, dir):
    self.queue = queue
    self.dir = dir
    threading.Thread.__init__(self)

    def run(self):
    self.addToQueue(self.dir)

    ## HELPER FUNCTION FOR INTERNAL USE ONLY
    def addToQueue(self,  rootDir):
      for root, subFolders, files in os.walk(rootDir):
    for file in files:
       self.queue.put(os.path.join(root,file))
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)

class Scanner(threading.Thread):

    def __init__(self, queue, patterns):
    self.queue = queue
    self.patterns = patterns
    threading.Thread.__init__(self)

    def run(self):
    nextFile = self.queue.get()
    while nextFile is not -1:
       #print "Trying " + nextFile
       self.scanFile(nextFile)
       nextFile = self.queue.get()


    #HELPER FUNCTION FOR INTERNAL UES ONLY
    def scanFile(self, file):
       fp = open(file)
       contents = fp.read()
       i=0
       #for patt in self.patterns:
       if self.patterns.search(contents):
      print "Match " + str(i) + " found in " + file

############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################


fileQueue = Queue.Queue()

#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
   #patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
   giantRE = giantRE + line.rstrip() + '|'

giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)

#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()

print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
   scanner = Scanner(fileQueue, giantRE)
   scanner.start()

This is obviously debugging\ugly code, do not mind the million queue.put(-1), I will clean this up later. Some indentations are not showing up properly, paticularly in scanFile.

Anyway some things I've noticed. Using 1, 4, and even 8 threads (for scanner in xrange(0,???):) does not make a difference. I still get ~72 seconds regardless. I assume this is due to python's GIL.

As opposed to making a giant regex I tried placing each line (pattern) as a compilex RE in a list and iterating through this list in my scanfile function. This resulted in longer execution time.

In an effort to avoid python's GIL I tried having each thread fork to grep as in:

#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
      s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
      output = s.communicate()[0]
      if output != '':
         print 'Matchfound in ' + file

This resulted in longer execution time.

Any suggestions on improving performance.

:::::::::::::EDIT::::::::

I can not post answers to my own questions yet however here are the answers to several points raised:

@David Nehme - Just to let people know I am aware of the fact that I have a million queue.put(-1)'s

@Blender - To mark the bottom of the queue. My scanner threads keep dequeing until they hit -1 which is at the bottom (while nextFile is not -1:). The processor cores is 8 however due to the GIL using 1 thread, 4 threads, or 8 threads does NOT make a difference. Spawning 8 subprocesses resulted in significantly slower code (142 sec vs 72)

@ed - Yes that and it's just as slow as the find\grep combo, actually slower because it indiscriminately greps file that aren't needed

@Ron - Can't upgrade, this must be universal. Do you think this will speed up > 72 seconds? The bash grepper does 358 seconds. My python giant RE method does 72 seconds w\ 1-8 threads. The popen method w\ 8 thrads (8 subprocesses) ran at 142 seconds. So far the giant RE python only method is a clear winner by far

@intuted

Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown.

 find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"

The python code is more efficient, I don't know why, but I experimentally tested it. I prefer to do this in python. I already achieved a speedup of 5x with python, I would like to get it sped up more.

:::::::::::::WINNER WINNER WINNER:::::::::::::::::

Looks like we have a winner.

intued's shell script comes in 2nd place with 34 seconds however @steveha's came in first with 24 seconds. Due to the fact that a lot of our boxes do not have python2.6 I had to cx_freeze it. I can write a shell script wrapper to wget a tar and unpack it. I do like intued's for simplicity however.

Thanks you for all your help guys, I now have an efficient tool for sysadmining

First, what's the point of posting code that you don't want readers to mind? — David Nehme, Oct 01 '11 at 22:04
What is the purpose of `self.queue.put(-1)`? Also, the number of threads should be the number of cores/threads the processor supports. You might have a singlecore. — Blender, Oct 01 '11 at 22:04
Have you tried just a single recursive grep command with the whole lot as a long regex? It's often a good idea to leave things like file handling to programs like grep that are already optimised. — ed., Oct 01 '11 at 22:22
ed's suggestion, combining all of the regexes, means that grep (or whatever) can merge the regexes and do some common work across all of them; you could see a large speedup merging the regexes in this way over matching each separately! — SingleNegationElimination, Oct 02 '11 at 00:59
Doesn't --file=?? treat it as a giant regex? For my python script I already DO use a giant regex, it's clear in my code. That is probably one reason why my python script takes 72 seconds — user974896, Oct 02 '11 at 03:36

score 5 · Answer 1 · answered Oct 01 '11 at 22:59

I'm a bit confused as to how your Python script ended up being faster than your find/grep combo. If you want to use grep in a way somewhat similar to what's suggested by Ron Smith in his answer, you can do something like

find -type f | xargs -d \\n -P 8 -n 100 grep --file=/root/patterns

to launch grep processes which will process 100 files before exiting, keeping up to 8 such processes active at any one time. Having them process 100 files should make the process startup overhead time of each one negligible.

note: The -d \\n option to xargs is a GNU extension which won't work on all POSIX-ish systems. It specifies that the *d*elimiter between filenames is a newline. Although technically filenames can contain newlines, in practice nobody does this and keeps their jobs. For compatibility with non-GNU xargs you need to add the -print0 option to find and use -0 instead of -d \\n with xargs. This will arrange for the null byte \0 (hex 0x00) to be used as the delimiter both by find and xargs.

You could also take the approach of first counting the number of files to be grepped

NUMFILES="$(find -type f | wc -l)";

and then using that number to get an even split among the 8 processes (assuming bash as shell)

find -type f | xargs -d \\n -P 8 -n $(($NUMFILES / 8 + 1)) grep --file=/root/patterns

I think this might work better because the disk I/O of find won't be interfering with the disk I/O of the various greps. I suppose it depends in part on how large the files are, and whether they are stored contiguously — with small files, the disk will be seeking a lot anyway, so it won't matter as much. Note also that, especially if you have a decent amount of RAM, subsequent runs of such a command will be faster because some of the files will be saved in your memory cache.

Of course, you can parameterize the 8 to make it easier to experiment with different numbers of concurrent processes.

As ed. mentions in the comments, it's quite possible that the performance of this approach will still be less impressive than that of a single-process grep -r. I guess it depends on the relative speed of your disk [array], the number of processors in your system, etc.

Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown. find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt" — user974896, Oct 02 '11 at 00:42
Like you, I am a fan of Python. I don't want to talk you out of using it. But @intuited has shown you how to drastically speed up the `find` solution: use `xargs` to build long command lines for `grep` to look through. He is starting 8 `grep` processes, each of which will handle 100 files at a time; you are starting a single `grep` process for each file, plus a `tee` process frequently. The overhead of starting up and shutting down all these processes must certainly be the reason you found Python to be faster. — steveha, Oct 02 '11 at 06:15

steveha · Accepted Answer · 2011-10-02T08:35:01.637

I think that, rather than using the threading module, you should be using the multiprocessing module for your Python solution. Python threads can run afoul of the GIL; the GIL is not a problem if you simply have multiple Python processes going.

I think that for what you are doing a pool of worker processes is just what you want. By default, the pool will default to one process for each core in your system processor. Just call the .map() method with a list of filenames to check and the function that does the checking.

http://docs.python.org/library/multiprocessing.html

If this is not faster than your threading implementation, then I don't think the GIL is your problem.

EDIT: Okay, I'm adding a working Python program. This uses a pool of worker processes to open each file and search for the pattern in each. When a worker finds a filename that matches, it simply prints it (to standard output) so you can redirect the output of this script into a file and you have your list of files.

EDIT: I think this is a slightly easier to read version, easier to understand.

I timed this, searching through the files in /usr/include on my computer. It completes the search in about half a second. Using find piped through xargs to run as few grep processes as possible, it takes about 0.05 seconds, about a 10x speedup. But I hate the baroque weird language you must use to get find to work properly, and I like the Python version. And perhaps on really big directories the disparity would be smaller, as part of the half-second for Python must have been startup time. And maybe half a second is fast enough for most purposes!

import multiprocessing as mp
import os
import re
import sys

from stat import S_ISREG


# uncomment these if you really want a hard-coded $HOME/patterns file
#home = os.environ.get('HOME')
#patterns_file = os.path.join(home, 'patterns')

target = sys.argv[1]
size_limit = int(sys.argv[2])
assert size_limit >= 0
patterns_file = sys.argv[3]


# build s_pat as string like:  (?:foo|bar|baz)
# This will match any of the sub-patterns foo, bar, or baz
# but the '?:' means Python won't bother to build a "match group".
with open(patterns_file) as f:
    s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))

# pre-compile pattern for speed
pat = re.compile(s_pat)


def walk_files(topdir):
    """yield up full pathname for each file in tree under topdir"""
    for dirpath, dirnames, filenames in os.walk(topdir):
        for fname in filenames:
            pathname = os.path.join(dirpath, fname)
            yield pathname

def files_to_search(topdir):
    """yield up full pathname for only files we want to search"""
    for fname in walk_files(topdir):
        try:
            # if it is a regular file and big enough, we want to search it
            sr = os.stat(fname)
            if S_ISREG(sr.st_mode) and sr.st_size >= size_limit:
                yield fname
        except OSError:
            pass

def worker_search_fn(fname):
    with open(fname, 'rt') as f:
        # read one line at a time from file
        for line in f:
            if re.search(pat, line):
                # found a match! print filename to stdout
                print(fname)
                # stop reading file; just return
                return

mp.Pool().map(worker_search_fn, files_to_search(target))

You're a boss!! Thanks. Your script ran the fastest however I do have the 2.6 compatibility issue. After some modifications I CX_Freezed it and will write a bash wrapper to download the tar and call it. — user974896, Oct 02 '11 at 16:18
I'm glad it worked for you! If you want it to run on 2.4 or 2.5, you could try this: http://pypi.python.org/pypi/multiprocessing/ — steveha, Oct 02 '11 at 20:15

score 1 · Answer 3 · answered Feb 07 '19 at 00:03

Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.

import os
import ray
import re

ray.init()

patterns_file = os.path.expanduser("~/patterns")
topdir = os.path.expanduser("~/folder")

with open(patterns_file) as f:
    s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))

regex = re.compile(s_pat)

@ray.remote
def match(pattern, fname):
    results = []
    with open(fname, 'rt') as f:
        for line in f:
            if re.search(pattern, line):
                results.append(fname)
    return results

results = []
for dirpath, dirnames, filenames in os.walk(topdir):
    for fname in filenames:
        pathname = os.path.join(dirpath, fname)
        results.append(match.remote(regex, pathname))

print("matched files", ray.get(results))

More information including how to run this on a cluster or the cloud is available in the documentatation

score 1 · Answer 4 · answered Oct 01 '11 at 22:27

If you are willing to upgrade to version 3.2 or better, you can take advantage of the concurrent.futures.ProcessPoolExecutor. I think it will improve performance over the popen method you attempted because it will pre-create a pool of processes where your popen method creates a new process every time. You could write your own code to do the same thing for an earlier version if you can't move to 3.2 for some reason.

Parallel file matching, Python

4 Answers4

Linked