1

I have an extensive dataset in an array format of a=[X, Y, Z, value]. At the same time i have another array b=[X,Y], with all the unique combinations of coordinates (X,Y) for the same dataset.

I would like to generate a new array, where for a given z=100, it contains the records of the original array a[X,Y,Z,value] where the Z is closest to the given z=100 for each possible X,Y combination.

The purpose of this is to extract a Z slice of the original dataset at a given depth

a description of the desired outcome would go like this

np.in1d(a[:,0], b[:,0]) and np.in1d(a[:,1], b[:,1]) # for each row
#where both these two arguments are True

a[:,2] == z+min(abs(a[:,2]-z))) # find the rows where Z is closest to z=100
#and append these rows to a new array c[X,Y,Z,value]

The idea is to first find the unique X,Y data and effectively slice the dataset in X,Y columns of the domain. Then search each of these columns to extract the row where Z is closest to the given z value

Any suggestion even for a much different approach would be highly appreciated

Red Sparrow
  • 387
  • 1
  • 5
  • 17

1 Answers1

1
from pylab import *
a=array(rand(10000,4))*[[20,20,200,1]] # data in a 20*20*200 space
a[:,:2] //= 1 # int coords for X,Y
bj=a.T[0]+1j*a.T[1] # trick for sorting on 2 cols.
b=np.unique(bj)
ib=bj.argsort() #  indices for sorting /X,Y
splits=bj[ib].searchsorted(b) # indices for splitting.
xy=np.split(a[ib],splits)  # list of subsets of data grouped by (x,y)
c=array([s[abs(s.T[2]-100).argmin()] for s in xy[1:]]) #locate the good point in each list 
print(c[:10])

gives:

[[   0.            0.          110.44068611    0.71688432]
 [   0.            1.          103.64897184    0.31287547]
 [   0.            2.          100.85948189    0.74353677]
 [   0.            3.          105.28286975    0.98118126]
 [   0.            4.           99.1188121     0.85775638]
 [   0.            5.          107.53733825    0.61015178]
 [   0.            6.          100.82311896    0.25322798]
 [   0.            7.          104.16430907    0.26522796]
 [   0.            8.          100.47370563    0.2433701 ]
 [   0.            9.          102.40445547    0.89028359]]

At a higher level, with pandas :

labels=list('xyzt')
df=pd.DataFrame(a,columns=labels)
df['dist']=abs(df.z-100)
indices=df.groupby(['x','y'])['dist'].apply(argmin)
c=df.ix[indices][labels].reset_index(drop=True)
print(c.head())

for

   x  y           z         t
0  0  0  110.440686  0.716884
1  0  1  103.648972  0.312875
2  0  2  100.859482  0.743537
3  0  3  105.282870  0.981181
4  0  4   99.118812  0.857756

It is clearer, but 8x slower.

B. M.
  • 18,243
  • 2
  • 35
  • 54
  • works like a charm in both cases. Indeed with pandas is a bit easier to follow but the first solution is faster. Since i am not so fluent with all this (yup you guessed right), have been trying to understand exactly how it works not sure i understand this part `a[:,:2] //= 1 # What does it actually do?` also, why do we need to transpose? the trick for sorting on two columns appears to have the same effect as `bj=np.lexsort((a.T[:,0],a.T[:,1])) ` Is there a difference between the two? – Red Sparrow Jan 28 '16 at 10:12
  • ` a[:,:2] //= 1` is for have a lot of z values for each (x,y). your sort method is quite good and less tricky. – B. M. Jan 28 '16 at 21:33
  • its still not clear to me what `a[:,:2] //= 1` does. i have looked also at [this](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation) but they don't explain the comma you use in the middle. – Red Sparrow Jan 29 '16 at 16:58
  • a is a 2D array, so the first `:` is for all the lines (10000) and :2 is for 0:2, ie the two first columns, x and y. rand generate only float numbers, so //=1 transform x,y in float without fractional parts, to achieve * only regularly spaced on the X, Y* . – B. M. Jan 29 '16 at 19:56
  • thanks for the clarification, now it's clear to me how it works – Red Sparrow Feb 01 '16 at 15:57