My code takes about two hours to process. The bottleneck is in for loop and if statements (see comment in code). I'm beginner with python :) Can anyone recommend an efficient python way to replace the nested for and if statements?
I have tables of ~30 million rows, each row with (x,y,z) values:
20.0 11.3 7
21.0 11.3 0
22.0 11.3 3
...
My desired output is a table in the form x, y, min(z), count(min(z)). The last column is a final count of the least z values at that (x,y). Eg:
20.0 11.3 7 7
21.0 11.3 0 10
22.0 11.3 3 1
...
There's only about 600 unique coordinates, so the output table will be 600x4. My code:
import numpy as np
file = open('input.txt','r');
coordset = set()
data = np.zeros((600,4))*np.nan
irow = 0
ctr = 0
for row in file:
item = row.split()
x = float(item[0])
y = float(item[1])
z = float(item[2])
# build unique grid of coords
if ((x,y)) not in coordset:
data[irow][0] = x
data[irow][1] = y
data[irow][2] = z
irow = irow + 1 # grows up to 599
# lookup table of unique coords
coordset.add((x,y))
# BOTTLENECK. replace ifs? for?
for i in range(0, irow):
if data[i][0]==x and data[i][1]==y:
if z > data[i][2]:
continue
elif z==data[i][2]:
ctr = ctr + 1
data[i][3]=ctr
if z < data[i][2]:
data[i][2] = z
ctr = 1
data[i][3]=ctr
edit: For reference the approach by @Joowani computes in 1m26s. My original approach, same computer, same datafile, 106m23s. edit2: @Ophion and @Sibster thanks for suggestions, I don't have enough credit to +1 useful answers.