Getting the count of every word in a file using threads

Question

Im currently trying to use threads to get the count of every word in a file in a parallel manner, but at the current time my code gets slower when i add even just an extra thread. I feel like it should get a decrease in time as the threads increase until i bottleneck my cpu then my times should get slower again. I don't understand why its not acting parallel.

here is the code

import thread
import threading
import time
import sys
class CountWords(threading.Thread):
    def __init__(self,lock,tuple):
        threading.Thread.__init__(self)
        self.lock = lock
        self.list = tuple[1]
        self.dit = tuple[0]
    def run(self):
        for word in self.list:
            #self.lock.acquire()
            if word in self.dit.keys():
                self.dit[word] = self.dit[word] + 1
            else:
                self.dit[word] = 1
            #self.lock.release()


def getWordsFromFile(numThreads, fileName):
    lists = []
    for i in range(int(numThreads)):
        k = []
        lists.append(k)
    print len(lists)
    file = open(fileName, "r")  # uses .read().splitlines() instead of readLines() to get rid of "\n"s
    all_words = map(lambda l: l.split(" "), file.read().splitlines()) 
    all_words = make1d(all_words)
    cur = 0
    for word in all_words:
        lists[cur].append(word)
        if cur == len(lists) - 1:
            cur = 0
        else:
            cur = cur + 1
    return lists

def make1d(list):
    newList = []
    for x in list:
        newList += x
    return newList

def printDict(dit):# prints the dictionary nicely
    for key in sorted(dit.keys()):
        print key, ":", dit[key]  



if __name__=="__main__":
    print "Starting now"
    start = int(round(time.time() * 1000))
    lock=threading.Lock()
    ditList=[]
    threadList = []
    args = sys.argv
    numThreads = args[1]
    fileName = "" + args[2]
    for i in range(int(numThreads)):
        ditList.append({})
    wordLists = getWordsFromFile(numThreads, fileName)
    zipped = zip(ditList,wordLists)
    print "got words from file"
    for tuple in zipped:
        threadList.append(CountWords(lock,tuple))
    for t in threadList:
        t.start()
    for t in threadList:
        if t.isAlive():
            t.join()
    fin = int(round(time.time() * 1000)) - start
    print "with", numThreads, "threads", "counting the words took :", fin, "ms"
    #printDict(dit)

Are you using threads to learn? Because I don't think this code would benefit from threads, as in python we can't run code in parallel. By addind threads you're only adding overhead, thats why it's slower. Threads mostly are used in python when you don't want to block your code with a expensive calculation or i/o. — forayer, Feb 21 '18 at 05:42
You are in for some disappointment... https://stackoverflow.com/questions/1294382/what-is-a-global-interpreter-lock-gil — Jeremy Friesner, Feb 21 '18 at 05:42

score 2 · Answer 1 · answered Feb 21 '18 at 06:05

You can use itertools for counting words in file.below is simple example code.explore itertools.groupby and modify code according to your logic.

import itertools

tweets = ["I am a cat", "cat", "Who is a good cat"]

words = sorted(list(itertools.chain.from_iterable(x.split() for x in tweets)))
count = {k:len(list(v)) for k,v in itertools.groupby(words)}

forayer · Answer 2 · 2018-02-21T06:11:07.000

0

Python cannot run threads in parallel (leveraging multiple cores) due to the GIL (What is a global interpreter lock (GIL)?).

Addind threads to this task is only increasing the overhead of your code, making it slower.

I can say two situations you can use threads:

When you have a lot of I/O: threads can make your code run concurrently (not in parallel https://blog.golang.org/concurrency-is-not-parallelism), thus your code can do a lot while waiting for response getting a good speed up.
You don't want a huge computation blocking your code: you use thread to run this computation concurrently with other tasks.

If you want to leverage all your cores you need to use the multiprocessing module (https://docs.python.org/3.6/library/multiprocessing.html).

edited Feb 21 '18 at 06:11

answered Feb 21 '18 at 05:44

forayer

367
2
10

Yes i am using this as a learning tool, so i see that Jython does not have the GIL so if i used that would the code speed up like intended? – Damion Monroe Feb 21 '18 at 05:46
In python you can't run threads in parallel, but you can run it concurrently (different things https://blog.golang.org/concurrency-is-not-parallelism) – forayer Feb 21 '18 at 05:47
I understand that now, Thank you – Damion Monroe Feb 21 '18 at 05:51
For paralellism you use the https://docs.python.org/3.6/library/multiprocessing.html – forayer Feb 21 '18 at 05:51

Getting the count of every word in a file using threads

2 Answers2