0

I have a file with several thousand records and a list of regular expressions. I’d like to take each record in the file in turn and evaluate it against my list of regular expressions to a point where a match if found.

I created a single threaded script and it does the job but is very slow. To make it multithreaded I made the following adjustments:

  1. Created the run_target() function that is be passed to the Thread constructor
  2. Created 5 worker threads
  3. Added the target function to the check_file() function.

Question: run_target() takes 2 arguments that I pass to it with each iteration of the while loop in the check_file() function. Do I need to somehow pass the arguments to the constructor when I create worker threads or shall I leave it as default? Or, should I pass keyword arguments with default values?

Also, is there a better or smarter way to tackle this. Thanks in advance.

def run_target(key, expr):
    matchStr = re.search(expr, key, re.I)
    if matchStr:
        return 1
    else:
        return 0


for i in range(number_of_threads):
    worker = Thread(target = run_target(), args = ())
    worker.daemon = True
    t.start()


def check_file():

    for key, value in data.items():
        while True:
            expr = q.get()
            result = run_target(key, expr)
            if result == 1:
                lock.acquire()
                print ‘Match found’
                lock.release()
                break
            q.task_done()
        q.join()
zan
  • 355
  • 6
  • 16
  • 1
    I don't understand this code at all. Your loop will create threads that try to run `run_target`, but they'll all fail since you're passing an empty tuple of arguments. Then `check_file` calls `run_target` itself, completely separate from the threads. Are you wanting to make a thread-pool or something (e.g. `concurrent.futures.ThreadPoolExecutor`)? In any case, I don't expect you'll get any speedup using threads for regular expression matching, since the work is CPU bound and the GIL will prevent any real concurrency. – Blckknght Aug 04 '18 at 19:32

1 Answers1

1

Re your first question - yes, as per the threading library documentation the function arguments need to be passed in Thread constructor. So instead of worker = Thread(target = run_target(), args = ()) you need something like worker = Thread(target = run_target, args = (key, expr)). Note no braces after run_target.

The code you have posted does not seem to do what you are intending, anyway. IMO, to achieve your goals, the better strategy is to have a function that takes a regex as an argument, and do the entire processing of the file in that function. And then spawn several threads with Thread(target = process_file, args = (expr,)) (note the comma after expr).

Note, that there is a known hurdle with threads in the most popular python distro, that make them useless on multicore CPUs - see more in this SO answer. If that is the case on your system, then using multiprocessing is a good alternative - the high level API is quite similar.

Happy coding :)

Evgeny Tanhilevich
  • 1,119
  • 1
  • 8
  • 17