I have two files. File A contains 1 million records. File B contains approximately 2,000 strings, each on a separate line.
I have a Python script that takes each string in File B in turn and searches for a match in File A. The Logic is as follows:
For string in File B:
For record in File A:
if record contains string: # I use regex for this
write record to a separate file
This is currently running as a single thread of execution and takes a few hours to complete.
I’d like to implement concurrency to speed up this script. What is the best way to approach it? I have looked into multi-threading but my scenario doesn’t seem to represent the producer-consumer problem as my machine has an SSD and I/O is not an issue. Would multiprocessing help with this?