Multiprocess in Python while reading a file line by line is 4000x slower than uniprocess, what is wrong?

Question

I need to read a large (10GB+) file line by line and process each line. The processing is fairly simple, so it seemed that multiprocessing was the way to go. However, when I set it up, it is much, much slower than running things linearly. My CPU usage never goes above 50%, so it's not a processing power issue.

I'm running Python 3.6 in Jupyter Notebook on a Mac.

This is what I have, working from the accepted answer posted here:

def do_work(in_queue, out_list):
    while True:
        line = in_queue.get()


        # exit signal 
        if line == None:
            return

        #fake work for testing
        elements = line.split("\t")
        out_list.append(elements)



if __name__ == "__main__":
    num_workers = 4

    manager = Manager()
    results = manager.list()
    work = manager.Queue(num_workers)

    # start for workers    
    pool = []
    for i in range(num_workers):
        p = Process(target=do_work, args=(work, results))
        p.start()
        pool.append(p)

    # produce data
    with open(file_on_my_machine, 'rt',newline="\n") as f:

        for line in f:
            work.put(line)

    for p in pool:
        p.join()

    # get the results
    print(sorted(results))

@Sraw yeah, it's nowhere near max. this slowdown even happens if I'm only running one worker. — Bethany Kok, May 18 '18 at 09:50
I am 99% sure that it is because of your memory usage as when you using `manager.Queue` and `manager.list` to share data between two processes. You are not really using the same physical space. Instead, the contents are transported between processes. So it will cost you much more memory. Maybe double, 150%, 130% or ? Depends on your structure. As you doesn't have enough memory even using single process, you will definitely suffer high memory swapping which slowdowns your processing significantly. — Sraw, May 18 '18 at 09:59
@sraw I also tried running it when the only thing in the "worker" being the instruction to pass, meaning no data was being stored at all. Still had the same problem. — Bethany Kok, May 18 '18 at 10:02

Multiprocess in Python while reading a file line by line is 4000x slower than uniprocess, what is wrong?

0 Answers0