I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as our current implementation is basically a bash script find\grep combo. The bash script takes 358 seconds on my benchmark directory. I was able to write a python script that did this in 72 seconds but want to improve more. First I will post the base-code and then tweaks I have tried:
import os, sys, Queue, threading, re
fileList = []
rootDir = sys.argv[1]
class Recurser(threading.Thread):
def __init__(self, queue, dir):
self.queue = queue
self.dir = dir
threading.Thread.__init__(self)
def run(self):
self.addToQueue(self.dir)
## HELPER FUNCTION FOR INTERNAL USE ONLY
def addToQueue(self, rootDir):
for root, subFolders, files in os.walk(rootDir):
for file in files:
self.queue.put(os.path.join(root,file))
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
class Scanner(threading.Thread):
def __init__(self, queue, patterns):
self.queue = queue
self.patterns = patterns
threading.Thread.__init__(self)
def run(self):
nextFile = self.queue.get()
while nextFile is not -1:
#print "Trying " + nextFile
self.scanFile(nextFile)
nextFile = self.queue.get()
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
fp = open(file)
contents = fp.read()
i=0
#for patt in self.patterns:
if self.patterns.search(contents):
print "Match " + str(i) + " found in " + file
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
fileQueue = Queue.Queue()
#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
#patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
giantRE = giantRE + line.rstrip() + '|'
giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)
#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()
print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
scanner = Scanner(fileQueue, giantRE)
scanner.start()
This is obviously debugging\ugly code, do not mind the million queue.put(-1), I will clean this up later. Some indentations are not showing up properly, paticularly in scanFile.
Anyway some things I've noticed. Using 1, 4, and even 8 threads (for scanner in xrange(0,???):) does not make a difference. I still get ~72 seconds regardless. I assume this is due to python's GIL.
As opposed to making a giant regex I tried placing each line (pattern) as a compilex RE in a list and iterating through this list in my scanfile function. This resulted in longer execution time.
In an effort to avoid python's GIL I tried having each thread fork to grep as in:
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
output = s.communicate()[0]
if output != '':
print 'Matchfound in ' + file
This resulted in longer execution time.
Any suggestions on improving performance.
:::::::::::::EDIT::::::::
I can not post answers to my own questions yet however here are the answers to several points raised:
@David Nehme - Just to let people know I am aware of the fact that I have a million queue.put(-1)'s
@Blender - To mark the bottom of the queue. My scanner threads keep dequeing until they hit -1 which is at the bottom (while nextFile is not -1:). The processor cores is 8 however due to the GIL using 1 thread, 4 threads, or 8 threads does NOT make a difference. Spawning 8 subprocesses resulted in significantly slower code (142 sec vs 72)
@ed - Yes that and it's just as slow as the find\grep combo, actually slower because it indiscriminately greps file that aren't needed
@Ron - Can't upgrade, this must be universal. Do you think this will speed up > 72 seconds? The bash grepper does 358 seconds. My python giant RE method does 72 seconds w\ 1-8 threads. The popen method w\ 8 thrads (8 subprocesses) ran at 142 seconds. So far the giant RE python only method is a clear winner by far
@intuted
Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown.
find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"
The python code is more efficient, I don't know why, but I experimentally tested it. I prefer to do this in python. I already achieved a speedup of 5x with python, I would like to get it sped up more.
:::::::::::::WINNER WINNER WINNER:::::::::::::::::
Looks like we have a winner.
intued's shell script comes in 2nd place with 34 seconds however @steveha's came in first with 24 seconds. Due to the fact that a lot of our boxes do not have python2.6 I had to cx_freeze it. I can write a shell script wrapper to wget a tar and unpack it. I do like intued's for simplicity however.
Thanks you for all your help guys, I now have an efficient tool for sysadmining