I am trying to figure out how I might use the multiprocessing
library for a task and I've been fidgeting with it to get an idea of how things work.
The task I am trying to solve is counting occurrences of specific words in a very very large file (corpus of text) and I wrote this basic script to solve the problem on a smaller scale before trying anything bigger.
I start with a nested list of names (4 chunks that are meant to represent a chuncked file) and I use a queue to track a dictionary of name counts in each chunk, then I merge that at the end. This is what I have so far.
import multiprocessing
from multiprocessing import Process, Queue, Pool, cpu_count
from timeit import default_timer as timer
import random
names = ["Julie", "Ben", "Rob", "Samantha", "Alice", "Jamie"]
file_chunks = [
[random.choice(names) for _ in range(1000)] for _ in range(4)
] # nested list of 4 chunks each containing 1000 randomly selected names from the <names> list
def count_occurances_in_part(queue, chunk):
name_count = {name:0 for name in names}
for name in chunk:
name_count[name] += 1
queue.put(name_count)
def main():
queue = Queue()
n_cpus = cpu_count()
print(f"Dividing tasks between {n_cpus} CPUs")
processes = [Process(target=count_occurances_in_part, args=(queue, chunk)) for chunk in file_chunks]
start = timer()
for p in processes:
p.start()
for p in processes:
p.join()
end = timer()
print(f'elapsed time: {end - start}')
results = [queue.get() for _ in processes]
final_dict = {name:0 for name in names}
for result in results:
for name, count in result.items():
final_dict[name] += count
print(final_dict)
In which case here is a sample output
Dividing tasks between 8 CPUs
elapsed time: 0.07983553898520768
{'Julie': 662, 'Ben': 698, 'Rob': 690, 'Samantha': 669, 'Alice': 653, 'Jamie': 628}
However, I am curious if there is a way for me to have a dictionary that is shared by all the processes so that I can just update the count there rather than having to merge things at the end. I was reading that using a queue is the preferred way but I don't know if that's possible with a queue.
The other question would be is there some easy way to merge the dictionaries at the end that would add up the counts for matching keys without having to run things in a loop in the event that my final dictionaries are very large and looping would take a long time?
Also, I have not used the multiprocessing package much so if you have any additional suggestions those would be much appreciated!