0

I have two files. File A contains 1 million records. File B contains approximately 2,000 strings, each on a separate line.

I have a Python script that takes each string in File B in turn and searches for a match in File A. The Logic is as follows:

For string in File B:
    For record in File A:
         if record contains string: # I use regex for this
            write record to a separate file

This is currently running as a single thread of execution and takes a few hours to complete.

I’d like to implement concurrency to speed up this script. What is the best way to approach it? I have looked into multi-threading but my scenario doesn’t seem to represent the producer-consumer problem as my machine has an SSD and I/O is not an issue. Would multiprocessing help with this?

zan
  • 355
  • 6
  • 16
  • Maybe use a DB instead of files? DBs are made exactly for that purpose – kmaork Jan 20 '18 at 20:39
  • I'm not sure concurrency will improve this. Try changing the algorithm first. What about keeping all the file B strings in memory (read one time) and reverse the loops? – Keith Jan 20 '18 at 23:57

2 Answers2

0

Running such a problem with multi-threads poses a couple of challenges:

  • We have to run over all of the records in file A in order to get the algorithm done.
  • We have to synchronize the writing to the separate file, so we won't override the printed records.

I'd suggest:

  1. Assign a single thread just for printing - so your external file won't get messed up.
  2. Open as many threads as you can support (n), and give each of them different 1000000/n records to work on.
GalAbra
  • 5,048
  • 4
  • 23
  • 42
  • And make sure to use an OS that will cache the files (e.g. Linux). That'll eliminate repeated HDD/SSD transactions. SSDs are fast, but they're no where near as fast as your machine's RAM. With enough RAM spare there's a good chance that all of A (and less importantly B) would effectively be held in RAM. – bazza Jan 20 '18 at 21:07
  • 1
    @GelAbra - I'll give that a try and will let you know how it goes – zan Jan 20 '18 at 21:44
  • @bazza - good point. I have a beefy machine that can comfortably handle 1GB files in memory. I'll read the file into mem first, and then start the processing. – zan Jan 20 '18 at 21:44
  • 1
    @zan, you shouldn't have to read them in yourself; the operating system's own file system drivers should work out that the files are worth caching. Same result, less code! Good luck – bazza Jan 20 '18 at 22:39
0

The processing you want to do requires checking whether any of the 2_000 strings is in each of the 1_000_000 records—which amounts to 2_000_000_000 such "checks" total. There's no way around that. Your current logic with the nested for loops just that iterates over all the possible combinations of things in the two files—one-by-one—and does the checking (and output file writing).

You need to determine the way (if any) ahat this could be accomplished in concurrently. For example you could have "N" tasks each checking for one string in each of the million records. The outputs from all these tasks represent the desired output and would likely need to be at aggregated together into a single file. Since the results will be in relatively random order, you may also want to sort it.

martineau
  • 119,623
  • 25
  • 170
  • 301