How to effectively paralellize a very simple task?

Question

I have a python script that performs a very simple task on a huge input file (>10M lines). The script boils down to:

for line in fileinput.input(remainder):
   obj=process(line)
   print_nicely(obj)

There is no interaction between the lines. But the output needs to be kept in the same order as the input lines.

My attempts to speed things up with multiprocessing like this:

p=mp.Pool(processes=4)
it=p.imap(process,fileinput.input(remainder))
for x in it:
    print_nicely(x)
p.close

It appears to make things slower, rather than faster. I assume this is due to overhead of passing the lines / objects between processes.

Is it possible to speed things up for use case/problem or is the overhead of multiprocessing in python just too high for this?

If the output needs to stay in the same order then it doesn't make sense to incorporate multiprocessing in the printing part since the order of output needs to be sequential - the 1st line needs to print before the 2nd line, etc. — michotross, Jan 02 '20 at 18:55
@michotross you can still parallelize the pre-process in its own for loop and then go through each preprocessed line to print them in another un-parallelized forloop. — Yacine Mahdid, Jan 02 '20 at 19:11
The documentation for `Pool.imap()` says there's an optional `chunksize` argument (which defaults to `1`) that "can make the job complete **much** faster" (emphasis theirs). — martineau, Jan 02 '20 at 19:13
@YacineMahdid agreed that's why I specified by the **printing part**. — michotross, Jan 02 '20 at 19:18
@martineau's suggestion of changing the `chunksize` argument should be the first thing you try. If that doesn't help, try [memory mapping the file](https://docs.python.org/3.8/library/mmap.html#mmap.mmap). If it's still too slow, you may be able to use something like [this chunk reading method](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) to process the file in parallel chunks, and sort the output chunks. — skrrgwasme, Jan 02 '20 at 19:38
Also, have you profiled this code to see if the bottleneck is the HD accesses or the processing? — skrrgwasme, Jan 02 '20 at 19:40
I did a full run without multiprocessing, and one with mutliprocessing and chunksize=100: The result was 17:21 vs. 28:13, so still much slower. — Sec, Jan 05 '20 at 18:22

michotross · Answer 1 · 2020-01-02T19:21:20.657

import multiprocessing as mp
import numpy as np

def process(line):
   # do something...

def process_data(data):
    return [process(line) for line in data]

num_processes = 4
data = fileinput.input(remainder)

indx = np.linspace(0, len(data), num_processes+1).astype(int)
data_split = [data[indx[i]: indx[i+1]] for i in range(len(indx)-1)]
pool = mp.Pool(processes=num_processes)
processed_data = [d for d in pool.map(process_data, data_split)]

How to effectively paralellize a very simple task?

1 Answers1