1

I have many points in the x,y plane, with length around 10000, each point (x,y) has an intrinsic radius r. This small data set is only one tiny corner of my entire data set. I have an interested point (x1,y1), I want to find nearby point around (x1,y1) within 1 and meet the criteria that the distance between (x,y) and (x1,y1) is less than r. I want to return the index of those good points, not the good points themselves.

import numpy as np
np.random.seed(2000)
x = 20.*np.random.rand(10000)
y = 20.*np.random.rand(10000)
r = 0.3*np.random.rand(10000)
x1 = 10.  ### (x1,y1) is an interest point 
y1 = 12.
def index_finder(x,y,r,x1,y1):
    idx = (abs(x - x1) < 1.) & (abs(y - y1) < 1.)  ### This cut will probably cut 90% of the data
    x_temp = x[idx]   ### but if I do like this, then I lose the track of the original index
    y_temp = y[idx]
    dis_square = (x_temp - x1)*(x_temp - x1) + (y_temp - y1)*(y_temp - y1)
    idx1 = dis_square < r*r    ### after this cut, there are only a few left 
    x_good = x_temp[idx1]
    y_good = y_temp[idx1]

In this function, I can find the good points around (x1,y1), but not the index of those good points. HOWEVER, I need the ORIGINAL index because the ORIGINAL index are used to extract other data associated with the coordinate (x,y). As I mentioned, the sample data set is only a tiny corner of my entire data set, I will call the above function around 1,000,000 times for my entire data set, therefore the efficiency of the above index_finder function is also a consideration.

Any thoughts on such task?

Huanian Zhang
  • 830
  • 3
  • 16
  • 37
  • How are using `index_finder` for all those points? Are you using it in a loop or just like that? – Divakar Sep 18 '17 at 06:37
  • I will use this function inside a loop Because I have many such interested point like `(x1,y1)`. This function itself can avoid any loop. And this data set is only 1/1000 of my whole data set. – Huanian Zhang Sep 18 '17 at 06:41

3 Answers3

1

Approach #1

We could simply index into the first mask with its own mask for selecting the True places masked values from the second stage, like so -

idx[idx] = idx1

Thus, idx would have the final valid masked values/ good valued places corresponding to original array x and y, i.e. -

x_good = x[idx]
y_good = y[idx]

This mask could then be used to index into other arrays as mentioned in the question.


Approach #2

As another approach, we could use two conditional statements , thus creating two masks with them. Finally, combine them with AND-ing to get the combined mask, which could be indexed into x and y arrays for the final outputs. We won't need to get the actual indices that way, so that's one more benefit with it.

Hence, the implementation -

X = x-x1
Y = y-y1
mask1 = (np.abs(X) < 1.) & (np.abs(Y) < 1.)
mask2 = X**2 + Y*2 < r**2
comb_mask = mask1 & mask2

x_good = x[comb_mask]
y_good = y[comb_mask]

If for some reason, you still need the corresponding indices, just do -

comb_idx = np.flatnonzero(comb_mask)

If you are doing these operations for different x1 and y1 pairs for the same x and y dataset, I would suggest using broadcasting to vectorize it through all those x1, y1 paired datasets, as shown in this post.

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks for your answer. I guess this implementation will be a little bit less efficient. I also want to speed it up because I will have a large loop around 1,000,000 times to call this function. – Huanian Zhang Sep 18 '17 at 06:55
  • @HuanianZhang Little bit less efficient than what? – Divakar Sep 18 '17 at 06:55
  • I guess it will be a little less efficient than my implementation. Because it only calculates about 10% of the data in the second cut. But the drawback of my implementation is that it can not return index. – Huanian Zhang Sep 18 '17 at 06:59
  • In order to speed it up (because I have 1 million loops), then I split the two cuts. I will test your implementation to see how long it will take. – Huanian Zhang Sep 18 '17 at 07:02
  • Thanks for your time and help. I tried your two approaches. Interesting, the second approach is more efficient. I have the resutls and time usage for both like below: `[1575 3709 5706] time used for approach 1 is 0.000797033309937 [1575 3709 5706] time used for approach 2 is 0.000538110733032`. That is totally unexpected. – Huanian Zhang Sep 18 '17 at 07:51
  • @HuanianZhang It's probably the indexing : `x[idx]` that's killing the first one. Also, we could use `power` instead of self-multiplication, updated. – Divakar Sep 18 '17 at 09:12
1

numpy.where seems made for finding the indices

the vectorized norm calc + np.where() could be faster than a loop

sq_norm = (x - x1)**2 + (y - y1)**2  # no need to take 10000 sqrt
idcs = np.where(sq_norm < 1.)

len(idcs[0])
Out[193]: 69

np.stack((idcs[0], x[idcs], y[idcs]), axis=1)[:5]
Out[194]: 
array([[  38.        ,    9.47165956,   11.94250173],
       [  39.        ,    9.6966941 ,   11.67505453],
       [ 276.        ,   10.68835317,   12.11589316],
       [ 288.        ,    9.93632584,   11.07624915],
       [ 344.        ,    9.48644057,   12.04911857]])

the norm calc can include the r array too, the 2nd step?

r_sq_norm = (x[idcs] - x1)**2 + (y[idcs] - y1)**2 - r[idcs]**2
r_idcs = np.where(r_sq_norm < 0.)

idcs[0][r_idcs]
Out[11]: array([1575, 3476, 3709], dtype=int64)

you might want to time the 2 step test vs including r in the 1st vectorized norm calc?

sq_norm = (x - x1)**2 + (y - y1)**2 - r**2
idcs = np.where(sq_norm < 0.)

idcs[0]
Out[13]: array([1575, 3476, 3709], dtype=int64)
f5r5e5d
  • 3,656
  • 3
  • 14
  • 18
0

You can take a mask of your indices, like so:

def index_finder(x,y,r,x1,y1):
    idx = np.nonzero((abs(x - x1) < 1.) & (abs(y - y1) < 1.))  #numerical, not boolean
    mask = (x[idx] - x1)*(x[idx] - x1) + (y[idx] - y1)*(y[idx] - y1) < r*r
    idx1 = [i[mask] for i in idx]
    x_good = x_temp[idx1]
    y_good = y_temp[idx1]

now idx1 is the indices you want to extract.

Faster way in general to do this is to use scipy.spatial.KDTree

from scipy.spatial import KDTree

xy = np.stack((x,y))
kdt = KDTree(xy)
kdt.query_ball_point([x1, y1], r)

If you have many points to query against the same dataset, this will be much faster than sequentially calling your index_finder app.

x1y1 = np.stack((x1, y1)) #`x1` and `y1` are arrays of coordinates.
kdt.query_ball_point(x1y1, r)

ALSO WRONG: if you have different distances for each point, you can do:

def query_variable_ball(kdtree, x, y, r):
    out = []
    for x_, y_, r_ in zip(x, y, r):
        out.append(kdt.query_ball_point([x_, y_], r_)
    return out

xy = np.stack((x,y))
kdt = KDTree(xy)
query_variable_ball(kdt, x1, y1, r)

EDIT 2: This should work with different r values for each point

from scipy.spatial import KDTree

def index_finder_kd(x, y, r, x1, y1):  # all arrays
    xy = np.stack((x,y), axis = -1)
    x1y1 = np.stack((x1, y1), axis = -1)
    xytree = KDTree(xy)
    d, i = xytree.query(x1y1, k = None, distance_upper_bound = 1.)
    good_idx = np.zeros(x.size, dtype = bool)
    for idx, dist in zip(i, d):
        good_idx[idx] |= r[idx] > dist
    x_good = x[good_idx]
    y_good = y[good_idx]
    return x_good, y_good, np.flatnonzero(good_idx)

This is very slow for only one (x1, y1) pair as the KDTree takes a while to populate. But if you have millions of pairs, this will be much faster.

(I've assumed you want the union of all good points in the (x, y) data for all (x1, y1), if you want them separately it's also possible using a similar method, removing elements of i[j] based on whether d[j] < r[i[j]])

Daniel F
  • 13,620
  • 2
  • 29
  • 55