0

I have a python script that performs a very simple task on a huge input file (>10M lines). The script boils down to:

for line in fileinput.input(remainder):
   obj=process(line)
   print_nicely(obj)

There is no interaction between the lines. But the output needs to be kept in the same order as the input lines.

My attempts to speed things up with multiprocessing like this:

p=mp.Pool(processes=4)
it=p.imap(process,fileinput.input(remainder))
for x in it:
    print_nicely(x)
p.close

It appears to make things slower, rather than faster. I assume this is due to overhead of passing the lines / objects between processes.

Is it possible to speed things up for use case/problem or is the overhead of multiprocessing in python just too high for this?

martineau
  • 119,623
  • 25
  • 170
  • 301
Sec
  • 7,059
  • 6
  • 31
  • 58
  • If the output needs to stay in the same order then it doesn't make sense to incorporate multiprocessing in the printing part since the order of output needs to be sequential - the 1st line needs to print before the 2nd line, etc. – michotross Jan 02 '20 at 18:55
  • 2
    @michotross you can still parallelize the pre-process in its own for loop and then go through each preprocessed line to print them in another un-parallelized forloop. – Yacine Mahdid Jan 02 '20 at 19:11
  • 1
    The documentation for `Pool.imap()` says there's an optional `chunksize` argument (which defaults to `1`) that "can make the job complete **much** faster" (emphasis theirs). – martineau Jan 02 '20 at 19:13
  • @YacineMahdid agreed that's why I specified by the **printing part**. – michotross Jan 02 '20 at 19:18
  • How long does it take with and without the multiprocessing? – Stefan Pochmann Jan 02 '20 at 19:21
  • 1
    @martineau's suggestion of changing the `chunksize` argument should be the first thing you try. If that doesn't help, try [memory mapping the file](https://docs.python.org/3.8/library/mmap.html#mmap.mmap). If it's still too slow, you may be able to use something like [this chunk reading method](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) to process the file in parallel chunks, and sort the output chunks. – skrrgwasme Jan 02 '20 at 19:38
  • Also, have you profiled this code to see if the bottleneck is the HD accesses or the processing? – skrrgwasme Jan 02 '20 at 19:40
  • I did a full run without multiprocessing, and one with mutliprocessing and chunksize=100: The result was 17:21 vs. 28:13, so still much slower. – Sec Jan 05 '20 at 18:22

1 Answers1

-2
import multiprocessing as mp
import numpy as np

def process(line):
   # do something...

def process_data(data):
    return [process(line) for line in data]

num_processes = 4
data = fileinput.input(remainder)

indx = np.linspace(0, len(data), num_processes+1).astype(int)
data_split = [data[indx[i]: indx[i+1]] for i in range(len(indx)-1)]
pool = mp.Pool(processes=num_processes)
processed_data = [d for d in pool.map(process_data, data_split)]
michotross
  • 364
  • 3
  • 12