I have a program that counts the frequency of lines used in a file serially. Files can be in sub-directories. Each file contains a list of Wikipedia categories, with each line being a category. I would like to know the frequency count of the categories across all files. For example a file called Los Angeles.txt
might have the following lines in it:
City
Location
And I want a tab separated file written out with the number of times each category was used in descending order:
Person 3494
City 2000
Location 1
My current code is:
import os
from collections import defaultdict
from operator import itemgetter
dir = "C:\\Wikipedia\\Categories"
l = [os.path.join(root, name) for root, _, files in os.walk(dir) for name in files]
d = defaultdict(int)
for file in l:
with open(file, encoding="utf8") as f_in:
for line in f_in:
line = line.strip() # Removes surrounding \n as well as spaces.
if line != "":
d[line] += 1
with open("C:\\Wikipedia\\category_counts.tsv", mode="w", encoding="utf8") as f_out:
for k2, v2 in sorted(d.items(), key=lambda kv: kv[1], reverse=True):
f_out.write(k2 + "\t" + str(v2) + "\n")
My question is how can I the Pool
of the multiprocessing
module to do this in a parallel way?
The issues that I'm wondering about are:
- Does the
multiprocessing
module only do processes or does it do threads as well, since this is an IO bound problem? - Can the
Counter
functionality fromitertools
be incorporated in some way? - Does
os.walk
already execute in a parallel manner? - Is there some sort of dictionary functionality in
multiprocessing
similar tomultiprocessing.Value
,multiprocessing.Queue
andmultiprocessing.Array
that I should be using to share the counts between the processes and thereby get an aggregated frequency count at the end? Can you use a normal Pythondict
withmultiprocessing
or will there be a sharing violation and corrupted data?
Can anyone help with a code example?