7

Suppose I have an example of numpy array:

import numpy as np
X = np.array([2,5,0,4,3,1])

And I also have a list of arrays, like:

A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]

I want to leave only these items of each list that are also in X. I expect also to do it in a most efficient/common way.

Solution I have tried so far:

  1. Sort X using X.sort().
  2. Find locations of items of each array in X using:

    locations = [np.searchsorted(X, n) for n in A]
    
  3. Leave only proper ones:

    masks = [X[locations[i]] == A[i] for i in range(len(A))]
    result = [A[i][masks[i]] for i in range(len(A))]
    

But it doesn't work because locations of third array is out of bounds:

locations = [array([0, 0, 2], dtype=int64), array([0, 1, 2, 3, 4, 5], dtype=int64), array([2, 5, 4, 6], dtype=int64)]

How to solve this issue?

Update

I ended up with idx[idx==len(Xs)] = 0 solution. I've also noticed two different approaches posted between the answers: transforming X into set vs np.sort. Both of them has plusses and minuses: set operations uses iterations which is quite slow in compare with numpy methods; however np.searchsorted speed increases logarithmically unlike acceses of set items which is instant. That why I decided to compare performance using data with huge sizes, especially 1 million items for X, A[0], A[1], A[2].

enter image description here

enter image description here

mathfux
  • 5,759
  • 1
  • 14
  • 34

2 Answers2

2

How about this, very simple and efficent:

import numpy as np
X = np.array([2,5,0,4,3,1])
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]

X_set = set(X)
A = [np.array([a for a in arr if a in X_set]) for arr in A]
#[array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 5, 4])]

According to the docs, set operations all have O(1) complexity, therefore the overall is O(N)

ExplodingGayFish
  • 2,807
  • 1
  • 5
  • 14
2

One idea would be less compute and minimal work when looping. So, here's one with those in mind -

a = np.concatenate(A)
m = np.isin(a,X)
l = np.array(list(map(len,A)))
a_m = a[m]
cut_idx = np.r_[0,l.cumsum()]
l_m = np.add.reduceat(m,cut_idx[:-1])
cl_m = np.r_[0,l_m.cumsum()]
out = [a_m[i:j] for (i,j) in zip(cl_m[:-1],cl_m[1:])]

Alternative #1 :

We can also use np.searchsorted to get the isin mask, like so -

Xs = np.sort(X)
idx = np.searchsorted(Xs,a)
idx[idx==len(Xs)] = 0
m = Xs[idx]==a

Another way with np.intersect1d

If you are looking for the most common/elegant one, think it would be with np.intersect1d -

In [43]: [np.intersect1d(X,A_i) for A_i in A]
Out[43]: [array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 4, 5])]

Solving your issue

You can also solve your out-of-bounds issue, with a simple fix -

for l in locations:
    l[l==len(X)]=0
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I'm sure using `intersect1d` is a good expression of my idea but I work with pretty huge `X` and `A` so that's why I have decided to sort `X` first. – mathfux Jan 09 '20 at 07:53
  • 2
    @mathfux I think you'll find sorting very large arrays is much slower than using `set` functions. – Daniel F Jan 09 '20 at 08:06