Python how to count line item frequency in large number of files in subdirectories using multiprocessing

Question

I have a program that counts the frequency of lines used in a file serially. Files can be in sub-directories. Each file contains a list of Wikipedia categories, with each line being a category. I would like to know the frequency count of the categories across all files. For example a file called Los Angeles.txt might have the following lines in it:

City
Location

And I want a tab separated file written out with the number of times each category was used in descending order:

Person 3494
City 2000
Location 1

My current code is:

import os
from collections import defaultdict
from operator import itemgetter

dir = "C:\\Wikipedia\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]

d = defaultdict(int)

for file in l:
    with open(file, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1

with open("C:\\Wikipedia\\category_counts.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: kv[1], reverse=True):
        f_out.write(k2 + "\t" + str(v2) + "\n")

My question is how can I the Pool of the multiprocessing module to do this in a parallel way?

The issues that I'm wondering about are:

Does the multiprocessing module only do processes or does it do threads as well, since this is an IO bound problem?
Can the Counter functionality from itertools be incorporated in some way?
Does os.walk already execute in a parallel manner?
Is there some sort of dictionary functionality in multiprocessing similar to multiprocessing.Value, multiprocessing.Queue and multiprocessing.Array that I should be using to share the counts between the processes and thereby get an aggregated frequency count at the end? Can you use a normal Python dict with multiprocessing or will there be a sharing violation and corrupted data?

Can anyone help with a code example?

You may want to refer to the following question: https://stackoverflow.com/questions/6832554/multiprocessing-how-do-i-share-a-dict-among-multiple-processes — 2ps, Dec 02 '19 at 16:14

score 0 · Accepted Answer · answered Dec 02 '19 at 18:39

I think I have figured it out (might be wrong, but it seems to work):

import os
from collections import defaultdict
from operator import itemgetter
from datetime import datetime
import concurrent.futures

# Loop through all the Wikipedia Article Category files and store their path and filename in a list. 1 second.
dir = "D:\\Downloads\\WikipediaAFLatest\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
print('After file list')

t1 = datetime.now()

d = defaultdict(int) 

def do_one_file(filename):
    with open(filename, encoding="utf8") as f_in:
        for line in f_in:
            line = line.strip()    # Removes surrounding \n as well as spaces.
            if line != "":
                d[line] += 1
    return True

# For each article (file) loop through all the categories.
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = executor.map(do_one_file, l)    # Do do_one_file for each file in the list l. No result is returned but shared dict d is updated

t2 = datetime.now()
print('After frequency counts: ' + str(t2 - t1))                    

t1 = datetime.now()
with open("D:\\Downloads\\WikipediaAFLatest\\category_counts_threaded.tsv", mode="w", encoding="utf8") as f_out:    
    for k2, v2 in sorted(d.items(), key=lambda kv: (-kv[1], kv[0])):    # Reverse sourt on count, normal sort on category
        f_out.write(k2 + "\t" + str(v2) + "\n")

t2 = datetime.now()
print('After sorted frequency counts: ' + str(t2 - t1))

Answers where:

I should use threading instead of multiprocessing.
Threads execute in the same process and can therefore access variables. concurrent.futures map will automatically lock variables, so shared access works. Processes gets a new process and therefore can't access the current process' variables. For them use a manager as per this answer.

Python how to count line item frequency in large number of files in subdirectories using multiprocessing

1 Answers1