Python: Compare Elements of two arrays

Question

I want to compare elements of two numpy arrays and delete the elements of one of those arrays if the eucledean distance between the coordinates is smaller than 1 and the time is the same. data_CD4 and data_CD8 are the arrays. The elements of the arrays are lists with 3D Coordinates and the time as 4th element (numpy.array([[x,y,z,time],[x,y,z,time].....]). Co is the Cutoff, here 1.

for i in data_CD8:
        for m in data_CD4:
            if distance.euclidean(tuple(i[:3]),tuple(m[:3])) < co and i[3]==m[3] :
                data_CD8=np.delete(data_CD8, i, 0)

Is there a faster approach to do that? The first array has 5000 elements, the second 2000, so it tooks too much time.

If you want you can use numpy also to compare, check this out: http://stackoverflow.com/questions/10580676/comparing-two-numpy-arrays-for-equality-element-wise — danglingpointer, Apr 04 '17 at 12:00
As @trincot pointed out that it must be `distance.euclidean(tuple(i[:3]),tuple(m[:3])) `. Could you confirm that? — Divakar, Apr 04 '17 at 12:41

Daniel F · Answer 1 · 2017-04-04T13:11:37.750

2

This should be a vectorized method.

mask1 = np.sum((data_CD4[:, None, :3] - data_CD8[None, :, :3])**2, axis = -1) < co**2
mask2 = data_CD4[:, None, 3] == data_CD8[None, :, 3]
mask3 = np.any(np.logical_and(mask1, mask2), axis = 0)
data_CD8 = data_CD8[~mask3]

mask1 should speed the distance calculations up as it doesn't require a square root call. mask1 and mask2 are 2-D arrays which we squeeze to 1d by np.any Doing all the deleting at the end prevents the pile of read/writes.

Speed testing:

a = np.random.randint(0, 10, (100, 3))

b = np.random.randint(0, 10, (100, 3))

%timeit cdist(a,b) < 5  #Divakar's answer
10000 loops, best of 3: 133 µs per loop

%timeit np.sum((a[None, :, :] - b[:, None, :]) ** 2, axis = -1) < 25  # My answer
1000 loops, best of 3: 418 µs per loop

And the C-compiled code wins, even when adding an unnecessary square root.

edited Apr 04 '17 at 13:11

answered Apr 04 '17 at 12:14

Daniel F

13,620
2
29
55

Thanks for the effort. I get this error when trying the code: IndexError: index 3284 is out of bounds for axis 0 with size 2587 – Varlor Apr 04 '17 at 12:36
Hard to tell what the error is without the line, but try `axis = 0` in `mask3` – Daniel F Apr 04 '17 at 12:40
Aaand as in Divakar's answer you need to invert `mask3` – Daniel F Apr 04 '17 at 12:49
Ah nice! Yeah now both of your codes have the same results. Can you maybe explain what you code does? I really want to understand it :) – Varlor Apr 04 '17 at 12:52
If the original arrays have dimensions `(n,4)` and `(m,4)` in both cases `mask1` and `mask2` are `(n,m)` boolean arrays comparing every 'row' of one array to every 'row' of the other. Then you use `np.any` to look for any `True` values for any of the `data_CD8` 'rows'. since you want to get rid of those rows, you invert the `np.any` output and index the original data against it to get your output. – Daniel F Apr 04 '17 at 13:00
The difference between @Divakar and my answer is he uses a C-compiled function `scipy.cdist` to do the distance calculation, while I use uncompiled `numpy` to compare the square of the distances. I'm not sure if it's faster to use the compiled code or to avoid doing the square root. – Daniel F Apr 04 '17 at 13:02
Aaand a little `%timeit` testing and @Divakar is faster by a factor of 3. – Daniel F Apr 04 '17 at 13:08

Divakar · Accepted Answer · 2017-04-04T12:55:48.063

2

Here's a vectorized approach using Scipy's cdist -

from scipy.spatial import distance

# Get eucliden distances between first three cols off data_CD8 and data_CD4
dists = distance.cdist(data_CD8[:,:3], data_CD4[:,:3])

# Get mask of those distances that are within co distance. This sets up the 
# first condition requirement as posted in the loopy version of original code.
mask1 = dists < co

# Take the third column off the two input arrays that represent the time values.
# Get the equality between all time values off data_CD8 against all time values
# off data_CD4. This sets up the second conditional requirement.
# We are adding a new axis with None, so that NumPY broadcasting
# would let us do these comparisons in a vectorized manner.
mask2 = data_CD8[:,3,None] == data_CD4[:,3]

# Combine those two masks and look for any match correponding to any 
# element off data_CD4. Since the masks are setup such that second axis
# represents data_CD4, we need numpy.any along axis=1 on the combined mask.
# A final inversion of mask is needed as we are deleting the ones that 
# satisfy these requirements.
mask3 = ~((mask1 & mask2).any(1))

# Finally, using boolean indexing to select the valid rows off data_CD8
out = data_CD8[mask3]

edited Apr 04 '17 at 12:55

answered Apr 04 '17 at 12:25

Divakar

218,885
19
262
358

Hmm, when trying your code nothing is deleted from array. With my code half of the elemts in data_CD8 is deleted. For now i can not say why. – Varlor Apr 04 '17 at 12:37
@Varlor It creates a new array with the deletions as `data_CD8_out`. Did you verify values in that array? Or just assign back with `data_CD8 = data_CD8[~((mask1 & mask2).any(1))]`? – Divakar Apr 04 '17 at 12:38
So data_CD8_out is the origanial array without the elements that do not fulfill the condition? Can you mabye explain your code. It seems really fast and i want to understand it :) – Varlor Apr 04 '17 at 12:46
@Varlor Added few explanations there. – Divakar Apr 04 '17 at 12:55
Hey thank you for the explanations. But what i dont get is adding a new axe throug "None" in the step of creating mask 2. I compare the 4th column of data_CD4 and data_CD8. If i dont add a new axes with None i get what seems like a list of booleans, but it is not a list. What happens here? – Varlor Apr 04 '17 at 14:17
@Varlor I am not sure why we are talking about lists. Both of `data_CD8` and `data_CD4` are NumPy arrays, right? – Divakar Apr 04 '17 at 14:19
1

@Varlor To answer your question though, with that `None`, we are extending the fourth column slice of `data_CD8` from 1D to 2D. So, when that extended version is compared against `data_CD4` it does comparison of all elems of that fourth col of `data_CD8` against all elems of `data_CD4`, giving us a 2D array of those comparisons. Without using `None`, we won't be extending `data_CD8` and it will simply do elementwise comparison between the two arrays and not `all elemens against all elems` comparison and that's not the expected operation. Hope that makes sense. – Divakar Apr 04 '17 at 14:24
How would i do this if i only wnat to calculate the distances to the coordinate (0,0,0) and only get those which have a higher distance than a given value and a lower distance than another value. Could i do this with 2 masks again? But which function would i use to calculate the distances? cdists wouldnt be suitable since it is only for 2 arrays. Best regard! – Varlor Apr 05 '17 at 10:57
@Varlor Please post a new question on it. Please add details into the new question on how it relates to this question. – Divakar Apr 05 '17 at 10:59
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/139963/discussion-between-varlor-and-divakar). – Varlor Apr 05 '17 at 11:42
Hey :), Is there a simple method to show those coordinates which were excluded from CD8 data? Best regards – Varlor Jun 01 '17 at 14:21

score 0 · Answer 3 · answered Apr 04 '17 at 12:02

if you have to compare all items in data_CD4 to the items in data_CD8 while removing data from data_CD8 it might be better to make the second iterable smaller on each iteration, which of course depends on your most common case.

for m in data_CD4:
    for i in data_CD8:
        if distance.euclidean(tuple(i[3:]),tuple(m[3:])) < co and i[3]==m[3] :
            data_CD8 = np.delete(data_CD8, i, 0)

Based on big O notation - and since this is O(n^2) - i don't see a faster solution.

Python: Compare Elements of two arrays

3 Answers3