This is for a homework assignment I am working. I have a working version of the code, but it currently takes ~1 hour to run on the files we've been given. I'll share an example of the files, as well as my code (and a high level description), and then could use thoughts on why my code is running as slow as it is. The first file below is the words file, for which I am approximating the number of times each word (represented as a number) has appeared:
the_words.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
13
16
17
6
18
19
20
21
22
23
24
25
6
26
27
28
29
30
9
31
32
33
34
15
35
36
37
9
38
39
11
40
13
41
42
The second file includes the parameters for 5 hash functions used in my script:
the_hashes.txt
3 1561
17 277
38 394
61 13
78 246
and here's a version of my code. At a high level, I (1) do my imports and set variables, (2) create a hash function, (3) loop over the words in the_words.txt (which is an int, confusing I know), hash each word using the 5 hash functions, and increment in the C matrix by 1 the value in the appropriate index. My code:
# imports
import numpy as np
import matplotlib.pyplot as plt
import math
# variables used throughout the program
dlt = math.exp(-5)
eps = math.exp(1) * math.pow(10, -4)
my_p = 123457
the_hashes = map(str.split, open('the_hashes.txt', 'r'))
the_hashes = [[int(float(j)) for j in i] for i in the_hashes]
end = len(the_hashes)
rows = math.ceil(math.log(1/dlt))
cols = math.ceil(math.exp(1)/eps)
C = np.zeros((rows,cols))
# Returns hash(x) for hash function
# given by parameters a, b, p and n_buckets
def hash_fun(a, b, p, n_buckets, x):
y = x % p
hash_val = (a*y + b) % p
output = hash_val % n_buckets
return(output)
# read the file line by line, implementing the algorithm
counter = 0
with open("the_words.txt", "r") as file:
for word in file:
counter = counter + 1
my_x = int(word)
# loop over the 5 different pairs of (a,b) values for the hashes
for i in range(0,end):
my_a = the_hashes[i][0]
my_b = the_hashes[i][1]
my_output = hash_fun(my_a, my_b, my_p, cols, my_x)
C[i,my_output] += 1
if(counter % 10000 == 0):
print counter
However, for a file with 200M words, this takes too long for me currently. Is there anything obvious that is causing my code to run slow? I know it can take a while to stream over 200M words, but I would like to cut this down from the hour it is currently taking.
Thanks!