1

I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.

So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below

npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]

# remove all the negative values 
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]

# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)

# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
                         bins = [xMax,yMax], 
                         range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:

    if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
         max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
        finalNewList.append(pt)

finalNew = np.array(finalNewList)  
Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
Alan Johnstone
  • 455
  • 1
  • 7
  • 24

1 Answers1

0

The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:

import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)

from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)

okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me

finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)

Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.

(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)

This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.

(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)

Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
  • Thankyou for taking the time to answer, however it may be the right way and the obvious way to do it but it is 3 times slower than the way I have shown and I hope to speed that up if I can work out the broadcasting. The timings in a jupyter notebook where 50.1 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) and 152 ms ± 973 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) The use of kdtrees may produce a better time but I am not an expert. – Alan Johnstone Jun 11 '19 at 15:14
  • Yup, kd-trees are great: see https://stackoverflow.com/a/49912284/500207 for the opposite problem—the question is about finding the same points in the two arrays, but you can readily adapt it to finding different points. Yes, `cdist` is doing a lot more work than you need (since if you find a single point in `oldPts` that's too close to a point in `newPts`, you can eliminate it right away, instead of enumerating *all* other old points). – Ahmed Fasih Jun 11 '19 at 22:33