1

I want to compare elements of two numpy arrays and delete the elements of one of those arrays if the eucledean distance between the coordinates is smaller than 1 and the time is the same. data_CD4 and data_CD8 are the arrays. The elements of the arrays are lists with 3D Coordinates and the time as 4th element (numpy.array([[x,y,z,time],[x,y,z,time].....]). Co is the Cutoff, here 1.

for i in data_CD8:
        for m in data_CD4:
            if distance.euclidean(tuple(i[:3]),tuple(m[:3])) < co and i[3]==m[3] :
                data_CD8=np.delete(data_CD8, i, 0)

Is there a faster approach to do that? The first array has 5000 elements, the second 2000, so it tooks too much time.

Divakar
  • 218,885
  • 19
  • 262
  • 358
Varlor
  • 1,421
  • 3
  • 22
  • 46

3 Answers3

2

This should be a vectorized method.

mask1 = np.sum((data_CD4[:, None, :3] - data_CD8[None, :, :3])**2, axis = -1) < co**2
mask2 = data_CD4[:, None, 3] == data_CD8[None, :, 3]
mask3 = np.any(np.logical_and(mask1, mask2), axis = 0)
data_CD8 = data_CD8[~mask3]

mask1 should speed the distance calculations up as it doesn't require a square root call. mask1 and mask2 are 2-D arrays which we squeeze to 1d by np.any Doing all the deleting at the end prevents the pile of read/writes.

Speed testing:

a = np.random.randint(0, 10, (100, 3))

b = np.random.randint(0, 10, (100, 3))

%timeit cdist(a,b) < 5  #Divakar's answer
10000 loops, best of 3: 133 µs per loop

%timeit np.sum((a[None, :, :] - b[:, None, :]) ** 2, axis = -1) < 25  # My answer
1000 loops, best of 3: 418 µs per loop

And the C-compiled code wins, even when adding an unnecessary square root.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Thanks for the effort. I get this error when trying the code: IndexError: index 3284 is out of bounds for axis 0 with size 2587 – Varlor Apr 04 '17 at 12:36
  • Hard to tell what the error is without the line, but try `axis = 0` in `mask3` – Daniel F Apr 04 '17 at 12:40
  • Aaand as in Divakar's answer you need to invert `mask3` – Daniel F Apr 04 '17 at 12:49
  • Ah nice! Yeah now both of your codes have the same results. Can you maybe explain what you code does? I really want to understand it :) – Varlor Apr 04 '17 at 12:52
  • If the original arrays have dimensions `(n,4)` and `(m,4)` in both cases `mask1` and `mask2` are `(n,m)` boolean arrays comparing every 'row' of one array to every 'row' of the other. Then you use `np.any` to look for any `True` values for any of the `data_CD8` 'rows'. since you want to get rid of those rows, you invert the `np.any` output and index the original data against it to get your output. – Daniel F Apr 04 '17 at 13:00
  • The difference between @Divakar and my answer is he uses a C-compiled function `scipy.cdist` to do the distance calculation, while I use uncompiled `numpy` to compare the square of the distances. I'm not sure if it's faster to use the compiled code or to avoid doing the square root. – Daniel F Apr 04 '17 at 13:02
  • Aaand a little `%timeit` testing and @Divakar is faster by a factor of 3. – Daniel F Apr 04 '17 at 13:08
2

Here's a vectorized approach using Scipy's cdist -

from scipy.spatial import distance

# Get eucliden distances between first three cols off data_CD8 and data_CD4
dists = distance.cdist(data_CD8[:,:3], data_CD4[:,:3])

# Get mask of those distances that are within co distance. This sets up the 
# first condition requirement as posted in the loopy version of original code.
mask1 = dists < co

# Take the third column off the two input arrays that represent the time values.
# Get the equality between all time values off data_CD8 against all time values
# off data_CD4. This sets up the second conditional requirement.
# We are adding a new axis with None, so that NumPY broadcasting
# would let us do these comparisons in a vectorized manner.
mask2 = data_CD8[:,3,None] == data_CD4[:,3]

# Combine those two masks and look for any match correponding to any 
# element off data_CD4. Since the masks are setup such that second axis
# represents data_CD4, we need numpy.any along axis=1 on the combined mask.
# A final inversion of mask is needed as we are deleting the ones that 
# satisfy these requirements.
mask3 = ~((mask1 & mask2).any(1))

# Finally, using boolean indexing to select the valid rows off data_CD8
out = data_CD8[mask3]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Hmm, when trying your code nothing is deleted from array. With my code half of the elemts in data_CD8 is deleted. For now i can not say why. – Varlor Apr 04 '17 at 12:37
  • @Varlor It creates a new array with the deletions as `data_CD8_out`. Did you verify values in that array? Or just assign back with `data_CD8 = data_CD8[~((mask1 & mask2).any(1))]`? – Divakar Apr 04 '17 at 12:38
  • So data_CD8_out is the origanial array without the elements that do not fulfill the condition? Can you mabye explain your code. It seems really fast and i want to understand it :) – Varlor Apr 04 '17 at 12:46
  • @Varlor Added few explanations there. – Divakar Apr 04 '17 at 12:55
  • Hey thank you for the explanations. But what i dont get is adding a new axe throug "None" in the step of creating mask 2. I compare the 4th column of data_CD4 and data_CD8. If i dont add a new axes with None i get what seems like a list of booleans, but it is not a list. What happens here? – Varlor Apr 04 '17 at 14:17
  • @Varlor I am not sure why we are talking about lists. Both of `data_CD8` and `data_CD4` are NumPy arrays, right? – Divakar Apr 04 '17 at 14:19
  • 1
    @Varlor To answer your question though, with that `None`, we are extending the fourth column slice of `data_CD8` from 1D to 2D. So, when that extended version is compared against `data_CD4` it does comparison of all elems of that fourth col of `data_CD8` against all elems of `data_CD4`, giving us a 2D array of those comparisons. Without using `None`, we won't be extending `data_CD8` and it will simply do elementwise comparison between the two arrays and not `all elemens against all elems` comparison and that's not the expected operation. Hope that makes sense. – Divakar Apr 04 '17 at 14:24
  • How would i do this if i only wnat to calculate the distances to the coordinate (0,0,0) and only get those which have a higher distance than a given value and a lower distance than another value. Could i do this with 2 masks again? But which function would i use to calculate the distances? cdists wouldnt be suitable since it is only for 2 arrays. Best regard! – Varlor Apr 05 '17 at 10:57
  • @Varlor Please post a new question on it. Please add details into the new question on how it relates to this question. – Divakar Apr 05 '17 at 10:59
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/139963/discussion-between-varlor-and-divakar). – Varlor Apr 05 '17 at 11:42
  • Hey :), Is there a simple method to show those coordinates which were excluded from CD8 data? Best regards – Varlor Jun 01 '17 at 14:21
0

if you have to compare all items in data_CD4 to the items in data_CD8 while removing data from data_CD8 it might be better to make the second iterable smaller on each iteration, which of course depends on your most common case.

for m in data_CD4:
    for i in data_CD8:
        if distance.euclidean(tuple(i[3:]),tuple(m[3:])) < co and i[3]==m[3] :
            data_CD8 = np.delete(data_CD8, i, 0)

Based on big O notation - and since this is O(n^2) - i don't see a faster solution.

widerin
  • 86
  • 1
  • 4