I need to read a large (10GB+) file line by line and process each line. The processing is fairly simple, so it seemed that multiprocessing was the way to go. However, when I set it up, it is much, much slower than running things linearly. My CPU usage never goes above 50%, so it's not a processing power issue.
I'm running Python 3.6 in Jupyter Notebook on a Mac.
This is what I have, working from the accepted answer posted here:
def do_work(in_queue, out_list):
while True:
line = in_queue.get()
# exit signal
if line == None:
return
#fake work for testing
elements = line.split("\t")
out_list.append(elements)
if __name__ == "__main__":
num_workers = 4
manager = Manager()
results = manager.list()
work = manager.Queue(num_workers)
# start for workers
pool = []
for i in range(num_workers):
p = Process(target=do_work, args=(work, results))
p.start()
pool.append(p)
# produce data
with open(file_on_my_machine, 'rt',newline="\n") as f:
for line in f:
work.put(line)
for p in pool:
p.join()
# get the results
print(sorted(results))