1

I am searching for a neat representation for creating a mask to use for array indexing, I Have 2 vectors, one represents the data I am interested in, and the other contains data itself. I tried to get this working as follows:

dataINeed = np.arange(3)

-array([0,1,2])

data = random.randint(10,size = (10)) 

-array([5,7,9,1,5,3,7,1,2,0])

mask = data in dataINeed

- array([False,False,False,True,False,False,False,True,True,True])

I know this might be achievable using set operations but I could not figure out the reciepe to get out such result. Any help on this?

JustInTime
  • 2,716
  • 5
  • 22
  • 25

2 Answers2

3

Could something like this be good?

>>> import numpy as np
>>> dataINeed = np.arange(3)
>>> dataINeed
array([0, 1, 2])
>>> data = np.array([5,7,9,1,5,3,7,1,2,0])
>>> dataINeedset = set(dataINeed)
>>> np.array([x in dataINeedset for x in data])
array([False, False, False,  True, False, False, False,  True,  True,  True], dtype=bool)
Roman Susi
  • 4,135
  • 2
  • 32
  • 47
  • Why to use `set` for something like that ? You are not using any set operation... – joaquin Jan 08 '12 at 13:13
  • 1
    I am using "in" to check if an element is in the set, which set is best at - its O(1) operation. If the size of dataINeed is always small (say, under 5), then x in dataINeed is enough. – Roman Susi Jan 08 '12 at 13:33
2

Roman Susi's solution is very fast (compared to the ideas I came up with).

Here are a few benchmarks against those other methods:

With this setup:

import numpy as np

N = 10000
m = 3000
dataINeed = np.arange(m)
data = np.random.randint(N,size = (N))

In [76]: %timeit dataINeedset = set(dataINeed); np.fromiter((x in dataINeedset for x in data),dtype = bool, count = -1)
100 loops, best of 3: 4.46 ms per loop

In [61]: %timeit ~np.prod(np.subtract.outer(data,dataINeed).astype('bool'),axis=-1,dtype='bool')
1 loops, best of 3: 335 ms per loop (Roman's solution is 75x faster than mine!)

In [54]: %timeit np.logical_or.reduce([(data == x) for x in dataINeed])
1 loops, best of 3: 1.72 s per loop  (Roman's solution is 386x faster)
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks. I think, it can be improved alittle bit by using numpy.fromiter((x in dataINeedset for x in data), bool, count=-1), but its mostly about memory, not speed. Hint found here: http://stackoverflow.com/questions/367565/how-do-i-build-a-numpy-array-from-a-generator – Roman Susi Jan 08 '12 at 13:50
  • Yes. Updated to use `np.iter`. – unutbu Jan 08 '12 at 14:08