1

My code takes about two hours to process. The bottleneck is in for loop and if statements (see comment in code). I'm beginner with python :) Can anyone recommend an efficient python way to replace the nested for and if statements?

I have tables of ~30 million rows, each row with (x,y,z) values:

20.0 11.3 7
21.0 11.3 0
22.0 11.3 3
...

My desired output is a table in the form x, y, min(z), count(min(z)). The last column is a final count of the least z values at that (x,y). Eg:

20.0 11.3 7 7
21.0 11.3 0 10
22.0 11.3 3 1
...

There's only about 600 unique coordinates, so the output table will be 600x4. My code:

import numpy as np
file = open('input.txt','r');

coordset = set()
data = np.zeros((600,4))*np.nan
irow = 0 
ctr = 0 

for row in file:
    item = row.split()
    x = float(item[0])
    y = float(item[1])
    z = float(item[2])

    # build unique grid of coords
    if ((x,y)) not in coordset:
        data[irow][0] = x 
        data[irow][1] = y 
        data[irow][2] = z 
        irow = irow + 1     # grows up to 599 

    # lookup table of unique coords
    coordset.add((x,y))

    # BOTTLENECK. replace ifs? for?
    for i in range(0, irow):
        if data[i][0]==x and data[i][1]==y:
            if z > data[i][2]:
                continue
            elif z==data[i][2]:
                ctr = ctr + 1
                data[i][3]=ctr
            if z < data[i][2]:
                data[i][2] = z
                ctr = 1
                data[i][3]=ctr

edit: For reference the approach by @Joowani computes in 1m26s. My original approach, same computer, same datafile, 106m23s. edit2: @Ophion and @Sibster thanks for suggestions, I don't have enough credit to +1 useful answers.

3150
  • 95
  • 8
  • is 30million rows really great to save in txt? shouldnt you look at some more sophisticated format to save (and read in) your data? Also i suggest vectorization (numpy) whenever possible since that pushes the for-loops into numpy, which is C (hence faster) – usethedeathstar Aug 19 '13 at 06:51

3 Answers3

2

Your solution seems slow because it iterates through the list (i.e. data) every time you make an update. A better approach would be using a dictionary, which takes O(1) as opposed to O(n) per update.

Here would be my solution using a dictionary:

file = open('input.txt', 'r')

#coordinates
c = {}

for line in file:
    #items
    (x, y, z) = (float(n) for n in line.split())

    if (x, y) not in c:
        c[(x, y)] = [z, 1]
    elif c[(x, y)][0] > z:
        c[(x, y)][0], c[(x, y)][1] = z, 1
    elif c[(x, y)][0] == z:
        c[(x, y)][1] += 1

for key in c:
    print("{} {} {} {}".format(key[0], key[1], c[key][0], c[key][1]))
Joohwan
  • 2,374
  • 1
  • 19
  • 30
  • 1
    To add some, I found a reference (http://wiki.python.org/moin/TimeComplexity) that explains some the efficiency in dictionary operations. – 3150 Aug 19 '13 at 20:36
0

Why not change the last if to an elif ?

Like it is done now you will evaluate the z < data[i][2]: every iteration of the loop.

You could even just replace it with an else since you have already checked if z>data[i][2] and z == data[i][2] so the only remaining possibility is z < data[i][2]:

So following code will do the same and should be faster :

        if z > data[i][2]:
            continue
        elif z==data[i][2]:
            ctr = ctr + 1
            data[i][3]=ctr
        else:
            data[i][2] = z
            ctr = 1
            data[i][3]=ctr
Sibster
  • 3,081
  • 21
  • 18
0

To do this in numpy use np.unique.

def count_unique(arr):
    row_view=np.ascontiguousarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[1])))
    ua, uind = np.unique(row_view,return_inverse=True)
    unique_rows = ua.view(a.dtype).reshape(ua.shape + (-1,))
    count=np.bincount(uind)
    return np.hstack((unique_rows,count[:,None]))

First lets check for a small array:

a=np.random.rand(10,3)
a=np.around(a,0)

print a
[[ 0.  0.  0.]
 [ 0.  1.  1.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  1.  1.]
 [ 1.  1.  0.]
 [ 1.  0.  1.]
 [ 1.  0.  1.]
 [ 1.  0.  0.]
 [ 0.  0.  0.]]

 print output
[[ 0.  0.  0.  2.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  1.  2.]
 [ 1.  0.  0.  2.]
 [ 1.  0.  1.  2.]
 [ 1.  1.  0.  1.]]

 print np.sum(output[:,-1])
 10

Looks good! Now lets check for a large array:

a=np.random.rand(3E7,3)
a=np.around(a,1)

output=count_unique(a)
print output.shape
(1331, 4)  #Close as I can get to 600 unique elements.

print np.sum(output[:,-1])
30000000.0

Takes about 33 second on my machine and 3GB of memory, doing this all in memory for large arrays will likely be your bottleneck. For a reference @Joowani's solution took about 130 seconds, although this is a bit of an apple and oranges comparison as we start with a numpy array. Your milage may vary.

To read in the data as a numpy array I would view the question here, but it should look something like the following:

arr=np.genfromtxt("./input.txt", delimiter=" ")

Loading in that much data from a txt file I would really recommend using the pandas example in that link.

Community
  • 1
  • 1
Daniel
  • 19,179
  • 7
  • 60
  • 74