2

I was reading a similar thread where the OP wanted to process each line in a function using multiprocessing (found here). The answer to this question that was intriguing was the following:

from multiprocessing import Pool

def process_line(line):
    return "FOO: %s" % line

if __name__ == "__main__":
    pool = Pool(4)
    with open('file.txt') as source_file:
        # chunk the work into batches of 4 lines at a time
        results = pool.map(process_line, source_file, 4)

I'm wondering if you can do the same, but instead of returning each line processed, write it into another file.

Basically I want to see if there is a way to MP reading and writing a file in order to split it up by lines. Say I want 100,000 lines per file.

from multiprocessing import Pool

def write_lines(line):
    #need method to write lines to multiple files, perhaps a Queue?

if __name__ == "__main__":
    #all my procs
    pool = Pool()
    with open('file.txt') as source_file:
        # chunk the work into batches of 4 lines at a time
        results = pool.map(process_line, source_file, 100000)

I could use a MP Queue to split up the file into separate Queue objects, then fill each processor with a job of writing out all the lines, but I still have to read through the file first. So will it always be completely IO bound and never be able to be MP in an efficient way?

Community
  • 1
  • 1
jwillis0720
  • 4,329
  • 8
  • 41
  • 74

1 Answers1

2

As you suspected, this is workload really won't benefit much (if at all) from multiprocessing. All you're doing here is reading one file, then writing the contents of that file to other files. This is completely I/O bound; the bottleneck is going to be the speed of reading and writing to disk. Using multiprocessing to try to write multiple files to the same disk concurrently isn't going to make the writes any faster, because the disk can only write one thing at a time.

Where multiprocessing can help is if you've got some CPU-bound work that can be parallelized, but that really isn't the case with what you're trying to do. If you wanted to read lines from a file, do some fairly heavy processing of each line, and then write them to some other file, multiprocessing would help, but it doesn't sound like you need to do any processing prior to writing each line.

dano
  • 91,354
  • 19
  • 222
  • 219