In python, speeding up a data stream word counting approximation algorithm

Question

This is for a homework assignment I am working. I have a working version of the code, but it currently takes ~1 hour to run on the files we've been given. I'll share an example of the files, as well as my code (and a high level description), and then could use thoughts on why my code is running as slow as it is. The first file below is the words file, for which I am approximating the number of times each word (represented as a number) has appeared:

the_words.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
13
16
17
6
18
19
20
21
22
23
24
25
6
26
27
28
29
30
9
31
32
33
34
15
35
36
37
9
38
39
11
40
13
41
42

The second file includes the parameters for 5 hash functions used in my script:

the_hashes.txt
3   1561
17  277
38  394
61  13
78  246

and here's a version of my code. At a high level, I (1) do my imports and set variables, (2) create a hash function, (3) loop over the words in the_words.txt (which is an int, confusing I know), hash each word using the 5 hash functions, and increment in the C matrix by 1 the value in the appropriate index. My code:

# imports
import numpy as np
import matplotlib.pyplot as plt
import math


# variables used throughout the program
dlt = math.exp(-5)
eps = math.exp(1) * math.pow(10, -4)
my_p = 123457

the_hashes = map(str.split, open('the_hashes.txt', 'r'))
the_hashes = [[int(float(j)) for j in i] for i in the_hashes]
end = len(the_hashes)

rows = math.ceil(math.log(1/dlt))
cols = math.ceil(math.exp(1)/eps)
C = np.zeros((rows,cols))


# Returns hash(x) for hash function 
# given by parameters a, b, p and n_buckets
def hash_fun(a, b, p, n_buckets, x):
    y = x % p
    hash_val = (a*y + b) % p
    output = hash_val % n_buckets
    return(output)


# read the file line by line, implementing the algorithm
counter = 0
with open("the_words.txt", "r") as file:
    for word in file:
        counter = counter + 1
        my_x = int(word)

        # loop over the 5 different pairs of (a,b) values for the hashes
        for i in range(0,end):
            my_a = the_hashes[i][0]
            my_b = the_hashes[i][1]

            my_output = hash_fun(my_a, my_b, my_p, cols, my_x)
            C[i,my_output] += 1

        if(counter % 10000 == 0):
            print counter

However, for a file with 200M words, this takes too long for me currently. Is there anything obvious that is causing my code to run slow? I know it can take a while to stream over 200M words, but I would like to cut this down from the hour it is currently taking.

Thanks!

Divide code into a few blocks, and then do time-measuring test (like [this](http://stackoverflow.com/questions/7370801/measure-time-elapsed-in-python)) with each block with reasonable words sample. Then you can figure out which block is taking the most of time. — Sangbok Lee, Mar 08 '17 at 06:28
Have you tried to run a profiler on your code, to get a sense of where most time is spent? Given how many times you're calling `hash_fun`, it could be worth implementing in C, but I'll assume that's not the aim of your exercise here. Some ideas: load the entire file in memory to avoid disk access in the loop, inline hash_fun to save a function call, run your code in pypy... — Laurent S, Mar 08 '17 at 06:30
one of the instructions is to not load the_words.txt file into memory. what takes the most time is that there's 200M lines of code, but yes I guess looking at one iteration / line of input and seeing what is slowest is a smart start. — Canovice, Mar 08 '17 at 06:43

score 1 · Accepted Answer · answered Mar 08 '17 at 17:16

If you can't load the data in memory, there are some parts you can inline and factor out:

my_range = range(0, end)  # python 2 only, see note below
with open("the_words.txt", "r") as file:
    for word in file:
        counter = counter + 1
        y = int(word) % p  # factor this out: save 160 million calculations
        # loop over the 5 different pairs of (a,b) values for the hashes
        for i in my_range:
            my_a = the_hashes[i][0]
            my_b = the_hashes[i][1]

            # save a function call by inlining
            # my_output = hash_fun(my_a, my_b, my_p, cols, my_x)

            hash_val = (a*y + b) % p
            my_output = hash_val % n_buckets
            C[i,my_output] += 1

        if(counter % 10000 == 0):
            print counter

I would also look at the math in hash_val = ... to see if you can factor out some calculations.

For range(0, end) depending on which version of python you're using, you might want to cache the call. See https://stackoverflow.com/a/40294986/1138710). (I suspect python 2 from your print statement).

Also, I suggest reading Python performance characteristics for some interesting ways to improve performance, or at least understand better what you're doing.

The above are just guesses. Check out How can you profile a script? for how to profile your code and know for sure where the bottleneck is.

My other guess, since you're using numpy, would be to rely on its matrix calculation functions, which I think are going to be better optimized. (a*y + b) % p looks like nice vector math to me :)

This is absolutely great, all really intuitive things to tweak. I'll take a look at the sources you've posted. — Canovice, Mar 09 '17 at 05:35

In python, speeding up a data stream word counting approximation algorithm

1 Answers1