Python - Multithreaded Word / Line Count

Question

I'm trying to get a handle on multithreading in python. I have working code that calculates the number of words, the number of lines with text, and creates a dict with the count of each word. It runs fast on small files like the one noted in the code comments. However I usually use glob to pull in multiple files. When I do I have significantly increased run times. Meanwhile since my script was single threaded I see that I have 3 other cores sitting idle while one maxes out.

I thought I would give pythons multithreading module a shot, here's what I have done so far (non-working):

#!/bin/python
#
# test file: http://www.gutenberg.org/ebooks/2852.txt.utf-8

import fileinput
from collections import defaultdict
import threading
import time

inputfilename = 'pg2852.txt'

exitFlag = 0

line = []
line_counter = 0
tot_words = 0
word_dict = defaultdict(int)

def myCounters( threadName, delay):
        for line in fileinput.input([inputfilename]):
                line = line.strip();
                if not line: continue
                words = line.split()
                tot_words += len(words)
                line_counter += 1
                for word in words:
                        word_dict[word] += 1

        print "%s: %s:" %( threadName, time.ctime(time.time()) )
        print word_dict
        print "Total Words: ", tot_words
        print "Total Lines: ", line_counter

try:
        thread.start_new_thread( myCounters, ("Thread-1", 2, ) )
        thread.start_new_thread( myCounters, ("Thread-2", 4, ) )
except:
        print "Error: Thread Not Started"

while 1:
        pass

For those of you who try this code, it doesn't work. I assume that I need to break the input file into chunks and merge the output somehow. ? map/reduce ? perhaps there's a simpler solution?

Edit:

Maybe something like:

open the file,
break it into chunks
feed each chunk to a different thread
get counts and build dict on each chunk
merge counts / dict
return results

Since your process is I/O-bound, not CPU-bound, running in parallel threads might have the adverse effect of increasing the elapsed time because of additional head movement. Parallel makes sense if the data is distributed on different disks, controllers etc. where it can actually be accessed in parallel because the hardware allows it. — uselpa, Jun 03 '12 at 08:06
@uselpa I agree.. and my data is actually on different disks, a 5 disk striped array. Given that, then based on your comment I assume then I should look into pythons multiprocessing module? — secumind, Jun 03 '12 at 08:17
What's with the `while 1: pass`? That's guaranteed to peg your CPU AND slow down processing of the text file. You should be using `Thread.join` if you wish the main thread to wait for the other threads to finish. — Dunes, Jun 03 '12 at 10:45
You should also use the `threading` module rather than `thread` as it provides better functionality. — Dunes, Jun 03 '12 at 10:50
@secumind I believe threads will do fine, although processes will not hurt. As Todd Owen says, 1 thread or process per file. I'd look into a worker thread architecture, see http://stackoverflow.com/questions/9217367/what-is-a-good-practice-design-to-thread-mulitple-sql-queries-in-python/9218304#9218304 — uselpa, Jun 03 '12 at 14:48

Todd Owen · Accepted Answer · 2012-06-04T00:46:27.180

First of all, you are correct that you need to use parallel processes rather than parallel threads. Doing this kind of task [see ETA below] will not scale well to multiple threads under python, due to the Global Interpreter Lock (GIL).

If you wanted to process a single file in parallel, the obvious way would be to first check the file size, then assign equal-sized chunks to multiple processes. That would just involve telling each process from what position in the file to start, and what position to end. (Of course, you would have to be careful not to count any words or lines twice. A simple approach would be to have each process ignore the initial bytes until it gets to the start of a line, and then start counting).

However, you state in your question that you will be using a glob to process multiple files. So instead of taking the complex route of chunking files and assigning the chunks to different processes, an easier option is simply assigning different files to different processes.

ETA:

Using threads in Python is suitable for certain use cases, such as using I/O functions that block for a long time. @uselpa is right that if processing is I/O bound then threads may perform well, but that is not the case here because the bottleneck is actually the parsing, not the file I/O. This is due to the performance characteristics of Python as an interpreted language; in a compiled language, the I/O is more likely to be the bottleneck.

I make these claims because I have just done some measuring based on the original code (using a test file containing 100 concatenated copies of pg2852.txt):

Running as a single thread took about 2.6s to read and parse the file, but only 0.2s when I commented out the parsing code.
Running two threads in parallel (reading from the same file) took 7.2s, but two single-threaded processes launched in parallel took only 3.3s to both complete.

+1 for different files to different processes. `make -j` or even `xargs -P` allows trivially to process multiple files in parallel. — jfs, Jun 03 '12 at 13:09
I don't agree; with an I/O-bound process, threads will just do fine, processes are not necessary. I do agree with the 1 thread per file approach. — uselpa, Jun 03 '12 at 14:46
@uselpa: I have done some tests to confirm my hunch; see above. — Todd Owen, Jun 04 '12 at 00:48
alright I'm convinced, I'll try it from a different files to different processes approach. @J.F.Sebastian I'll try make -j / xargs -P and see where I get in the next 24 hours. I'll report back here when I'm done. — secumind, Jun 04 '12 at 02:42

Python - Multithreaded Word / Line Count

1 Answers1