I'm trying to get a handle on multithreading in python. I have working code that calculates the number of words, the number of lines with text, and creates a dict with the count of each word. It runs fast on small files like the one noted in the code comments. However I usually use glob to pull in multiple files. When I do I have significantly increased run times. Meanwhile since my script was single threaded I see that I have 3 other cores sitting idle while one maxes out.
I thought I would give pythons multithreading module a shot, here's what I have done so far (non-working):
#!/bin/python
#
# test file: http://www.gutenberg.org/ebooks/2852.txt.utf-8
import fileinput
from collections import defaultdict
import threading
import time
inputfilename = 'pg2852.txt'
exitFlag = 0
line = []
line_counter = 0
tot_words = 0
word_dict = defaultdict(int)
def myCounters( threadName, delay):
for line in fileinput.input([inputfilename]):
line = line.strip();
if not line: continue
words = line.split()
tot_words += len(words)
line_counter += 1
for word in words:
word_dict[word] += 1
print "%s: %s:" %( threadName, time.ctime(time.time()) )
print word_dict
print "Total Words: ", tot_words
print "Total Lines: ", line_counter
try:
thread.start_new_thread( myCounters, ("Thread-1", 2, ) )
thread.start_new_thread( myCounters, ("Thread-2", 4, ) )
except:
print "Error: Thread Not Started"
while 1:
pass
For those of you who try this code, it doesn't work. I assume that I need to break the input file into chunks and merge the output somehow. ? map/reduce ? perhaps there's a simpler solution?
Edit:
Maybe something like:
- open the file,
- break it into chunks
- feed each chunk to a different thread
- get counts and build dict on each chunk
- merge counts / dict
- return results