1

I am trying to figure out how I might use the multiprocessing library for a task and I've been fidgeting with it to get an idea of how things work. The task I am trying to solve is counting occurrences of specific words in a very very large file (corpus of text) and I wrote this basic script to solve the problem on a smaller scale before trying anything bigger.

I start with a nested list of names (4 chunks that are meant to represent a chuncked file) and I use a queue to track a dictionary of name counts in each chunk, then I merge that at the end. This is what I have so far.

import multiprocessing
from multiprocessing import Process, Queue, Pool, cpu_count
from timeit import default_timer as timer
import random

names = ["Julie", "Ben", "Rob", "Samantha", "Alice", "Jamie"]
file_chunks = [
    [random.choice(names) for _ in range(1000)] for _ in range(4)
] # nested list of 4 chunks each containing 1000 randomly selected names from the <names> list


def count_occurances_in_part(queue, chunk):
    name_count = {name:0 for name in names}
    for name in chunk:
        name_count[name] += 1

    queue.put(name_count)
    

def main():
    queue = Queue()
    n_cpus = cpu_count()
    print(f"Dividing tasks between {n_cpus} CPUs")
    
    processes = [Process(target=count_occurances_in_part, args=(queue, chunk)) for chunk in file_chunks]
    
    start = timer()
    
    for p in processes:
        p.start()
    
    for p in processes:
        p.join()
    
    end = timer()
    print(f'elapsed time: {end - start}')
    
    results = [queue.get() for _ in processes]
    
    final_dict = {name:0 for name in names}
    for result in results:
        for name, count in result.items():
            final_dict[name] += count
            
    print(final_dict)

In which case here is a sample output

Dividing tasks between 8 CPUs
elapsed time: 0.07983553898520768
{'Julie': 662, 'Ben': 698, 'Rob': 690, 'Samantha': 669, 'Alice': 653, 'Jamie': 628}

However, I am curious if there is a way for me to have a dictionary that is shared by all the processes so that I can just update the count there rather than having to merge things at the end. I was reading that using a queue is the preferred way but I don't know if that's possible with a queue.

The other question would be is there some easy way to merge the dictionaries at the end that would add up the counts for matching keys without having to run things in a loop in the event that my final dictionaries are very large and looping would take a long time?

Also, I have not used the multiprocessing package much so if you have any additional suggestions those would be much appreciated!

djvaroli
  • 1,223
  • 1
  • 11
  • 28
  • normally processes don't share memory and you would have to use module [multiprocessing.shared_memory](https://docs.python.org/3/library/multiprocessing.shared_memory.html). But probably it would be more complex then using current code. – furas Mar 01 '21 at 19:51
  • maybe you should use special dictionary `collections.Counter` instead of normal dictionary. – furas Mar 01 '21 at 19:55
  • you import `Pool` but you don't use it - but with `Pool` it could be simpler - it wouldn't need `join()` nor `queue`. – furas Mar 01 '21 at 19:57
  • as for me you should create `file_chunks` inside `main()` because current code creates new `file_chunks` in every process separatelly but you don't need it. – furas Mar 01 '21 at 19:59
  • 3
    You can use a `multiprocessing.manager` to create a shared dictionary between child processes, but I wouldn't really call it a "better" solution to what you're doing now. due to the implementation, managers aren't very fast at all, so it would be a bit slower and more complicated than what you're already doing. The fastest solution would be to share a numpy array, and assign a numeric index to each name. [Here's](https://stackoverflow.com/a/66424666/3220135) a recent answer of mine on how to share numpy arrays with `shared_memory` – Aaron Mar 01 '21 at 20:05
  • 2
    Right. "Merge at the end" is really the best way with Python. Overall, this is a task I would choose to do in C++ with multithreading. I love Python, but it sucks at computationally-intensive multitasking. – Tim Roberts Mar 01 '21 at 20:22

0 Answers0