20

After many attempts trying optimize code, it seems that one last resource would be to attempt to run the code below using multiple cores. I don't know exactly how to convert/re-structure my code so that it can run much faster using multiple cores. I will appreciate if I could get guidance to achieve the end goal. The end goal is to be able to run this code as fast as possible for arrays A and B where each array holds about 700,000 elements. Here is the code using small arrays. The 700k element arrays are commented out.

import numpy as np

def ismember(a,b):
    for i in a:
        index = np.where(b==i)[0]
        if index.size == 0:
            yield 0
        else:
            yield index


def f(A, gen_obj):
    my_array = np.arange(len(A))
    for i in my_array:
        my_array[i] = gen_obj.next()
    return my_array


#A = np.arange(700000)
#B = np.arange(700000)
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])

gen_obj = ismember(A,B)

f(A, gen_obj)

print 'done'
# if we print f(A, gen_obj) the output will be: [4 0 0 4 3]
# notice that the output array needs to be kept the same size as array A.

What I am trying to do is to mimic a MATLAB function called ismember[2] (The one that is formatted as: [Lia,Locb] = ismember(A,B). I am just trying to get the Locb part only.

From Matlab: Locb, contain the lowest index in B for each value in A that is a member of B. The output array, Locb, contains 0 wherever A is not a member of B

One of the main problems is that I need to be able to perform this operation as efficient as possible. For testing I have two arrays of 700k elements. Creating a generator and going through the values of the generator doesn't seem to get the job done fast.

zd5151
  • 395
  • 1
  • 3
  • 15

5 Answers5

18

Before worrying about multiple cores, I would eliminate the linear scan in your ismember function by using a dictionary:

def ismember(a, b):
    bind = {}
    for i, elt in enumerate(b):
        if elt not in bind:
            bind[elt] = i
    return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value

Your original implementation requires a full scan of the elements in B for each element in A, making it O(len(A)*len(B)). The above code requires one full scan of B to generate the dict Bset. By using a dict, you effectively make the lookup of each element in B constant for each element of A, making the operation O(len(A)+len(B)). If this is still too slow, then worry about making the above function run on multiple cores.

Edit: I've also modified your indexing slightly. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so if you're data set looks like this

A = [2378, 2378, 2378, 2378]
B = [2378, 2379]

and you return 0 for no element, then your results will exclude all elements of A. The above routine returns None for no index instead of 0. Returning -1 is an option, but Python will interpret that to be the last element in the array. None will raise an exception if it's used as an index into the array. If you'd like different behavior, change the second argument in the Bind.get(item,None) expression to the value you want returned.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
sfstewman
  • 5,589
  • 1
  • 19
  • 26
  • Wow this is really fast ! You have no idea how much I appreciate your solution. Thank you very much ! Do you use a particular tool to output a performance profile ? – zd5151 Apr 07 '13 at 16:18
  • 7
    @z5151 No, it's straightforward algorithmic analysis. Using [Big-O notation](http://en.wikipedia.org/wiki/Big_O_notation): `np.where` has to perform a linear scan of `B`, which requires `O(len(B))` operations. You then use an outer loop that requires `O(len(A))` operations, making your original algorithm roughly `O(len(A)*len(B))` operations. Generating `Bind` requires `len(B)` operations. Dictionaries are implemented as [hash tables](http://en.wikipedia.org/wiki/Hash_table), which have constant `O(1)` lookup, so scanning A is `O(len(A))`; the overall complexity is `O(len(A)+len(B))`. – sfstewman Apr 07 '13 at 16:28
  • Got it. Thanks for the wikipedia reference. – zd5151 Apr 07 '13 at 17:24
  • @z5151 I've changed the answer to try and alleviate a possible problem with your indexing. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so an index of zero is a valid element. The `ismember` function above now returns `None` if the element does not exist in the list. – sfstewman Apr 07 '13 at 18:02
  • Thanks. I appreciate your emphasis on comparing both Matlab and Python. – zd5151 Apr 08 '13 at 02:22
  • I vastly simplified the code. It should also run faster, as a consequence. I also followed the PEP 8 style recommandations. – Eric O. Lebigot Apr 08 '13 at 04:03
  • 1
    @EOL No, you broke the code. The element returned is now the **last** occurrence in the list, not the first. There's a reason that I didn't use a dictionary comprehension in the original code. – sfstewman Apr 08 '13 at 15:04
  • 1
    @EOL As far as I can tell, you can use a dictionary comprehension by either iterating over the reversed range: `{ B[i] : i for i in xrange(len(B)-1,-1,-1) }`. Or by using a reverse iterator: `{ elt : len(B)-i-1) for (i,elt) in enumerate(reversed(B)) }` Neither one is pretty (or simple). The first assumes that B is indexable, and the second that B is reversible. They also assume that random index/reverse iteration are cheap. If B is a very large linked list, your performance will be abysmal. Is there a method of using a comprehension to retrieve the first index that only assumes iteration? – sfstewman Apr 08 '13 at 15:25
  • @sfstewman: You are right, about the *first* occurrence being required (I thought it did not matter). Since in the question `B` is a NumPy array, both of the dictionary comprehensions that you give work. `reversed()` is even efficient since it gives an iterator. Timing the three approaches would tell us which one is the fastest (`len(B)-1` should probably be precomputed, in the `reversed()` approach). – Eric O. Lebigot Apr 09 '13 at 02:14
  • PS: I did modify your code and applied PEP 8 guidelines, though. I also simplified the `if bind.get(elt) is None` as the standard, more explicit and faster `if elt not in bind` (which does not force the reader to know that `None` is not a possible dictionary value). – Eric O. Lebigot Apr 09 '13 at 02:21
  • Amazing... this should be part of the numpy implementation! – Vivek Subramanian May 28 '19 at 14:08
15

sfstewman's excellent answer most likely solved the issue for you.

I'd just like to add how you can achieve the same exclusively in numpy.

I make use of numpy's unique an in1d functions.

B_unique_sorted, B_idx = np.unique(B, return_index=True)
B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)
  • B_unique_sorted contains the unique values in B sorted.
  • B_idx holds for these values the indices into the original B.
  • B_in_A_bool is a boolean array the size of B_unique_sorted that stores whether a value in B_unique_sorted is in A.
    Note: I need to look for (unique vals from B) in A because I need the output to be returned with respect to B_idx
    Note: I assume that A is already unique.

Now you can use B_in_A_bool to either get the common vals

B_unique_sorted[B_in_A_bool]

and their respective indices in the original B

B_idx[B_in_A_bool]

Finally, I assume that this is significantly faster than the pure Python for-loop although I didn't test it.

tzelleke
  • 15,023
  • 5
  • 33
  • 49
  • +1 for just using numpy wherever possible, major speedups can be achieved this way (as I have learned the hard way >_<) – James Porter Jul 07 '13 at 01:45
  • 2
    Watch out! This does not preserve the order of the indices! try it with range(1,6) and [5,1]. In case the order of the indices is not need I think you can just use np.in1d() and then np.nonzero()[0] – aless80 Jan 26 '18 at 02:55
  • 1
    See the answer here: https://stackoverflow.com/questions/33678543/finding-indices-of-matches-of-one-array-in-another-array for getting indices in the right order – aless80 Jan 26 '18 at 03:01
2

Try the ismember library.

pip install ismember

Simple example:

# Import library
from ismember import ismember
import numpy as np

# data
A = np.array([3,4,4,3,6])
B = np.array([2,5,2,6,3])

# Lookup
Iloc,idx = ismember(A, B)
 
# Iloc is boolean defining existence of d in d_unique
print(Iloc)
# [ True False False  True  True]

# indexes of d_unique that exists in d
print(idx)
# [4 4 3]

print(B[idx])
# [3 3 6]

print(A[Iloc])
# [3 3 6]

# These vectors will match
A[Iloc]==B[idx]

Speed check:

from ismember import ismember
from datetime import datetime

t1=[]
t2=[]
# Create some random vectors
ns = np.random.randint(10,10000,1000)

for n in ns:
    a_vec = np.random.randint(0,100,n)
    b_vec = np.random.randint(0,100,n)

    # Run stack version
    start = datetime.now()
    out1=ismember_stack(a_vec, b_vec)
    end = datetime.now()
    t1.append(end - start)

    # Run ismember
    start = datetime.now()
    out2=ismember(a_vec, b_vec)
    end = datetime.now()
    t2.append(end - start)


print(np.sum(t1))
# 0:00:07.778331

print(np.sum(t2))
# 0:00:04.609801

# %%
def ismember_stack(a, b):
    bind = {}
    for i, elt in enumerate(b):
        if elt not in bind:
            bind[elt] = i
    return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value

The ismember function from pypi is almost 2x faster.

Large vectors, eg 700000 elements:

from ismember import ismember
from datetime import datetime

A = np.random.randint(0,100,700000)
B = np.random.randint(0,100,700000)

# Lookup
start = datetime.now()
Iloc,idx = ismember(A, B)
end = datetime.now()

# Print time
print(end-start)
# 0:00:01.194801
erdogant
  • 1,544
  • 14
  • 23
1

Try using a list comprehension;

In [1]: import numpy as np

In [2]: A = np.array([3,4,4,3,6])

In [3]: B = np.array([2,5,2,6,3])

In [4]: [x for x in A if not x in B]
Out[4]: [4, 4]

Generally, list comprehensions are much faster than for-loops.

To get an equal length-list;

In [19]: map(lambda x: x if x not in B else False, A)
Out[19]: [False, 4, 4, False, False]

This is quite fast for small datasets:

In [20]: C = np.arange(10000)

In [21]: D = np.arange(15000, 25000)

In [22]: %timeit map(lambda x: x if x not in D else False, C)
1 loops, best of 3: 756 ms per loop

For large datasets, you could try using a multiprocessing.Pool.map() to speed up the operation.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • The output array needs to be kept the same size. – zd5151 Apr 07 '13 at 15:59
  • @z5151: See enhanced answer. You can change the `lambda` expression to return 0 instead of False if you want, but that would mask real 0's in the results. – Roland Smith Apr 07 '13 at 16:05
  • This is useful for arrays with small number of elements. Thank you for emphasizing that list comprehensions are much faster than loops. – zd5151 Apr 07 '13 at 16:15
  • Your answer returns the elements, not the index of the elements in B. – sfstewman Apr 07 '13 at 16:34
1

Here is the exact MATLAB equivalent that returns both the output arguments [Lia, Locb] that match MATLAB except in Python 0 is also a valid index. So, this function doesn't return the 0s. It essentially returns Locb(Locb>0). The performance is also equivalent to MATLAB.

def ismember(a_vec, b_vec):
    """ MATLAB equivalent ismember function """

    bool_ind = np.isin(a_vec,b_vec)
    common = a[bool_ind]
    common_unique, common_inv  = np.unique(common, return_inverse=True)     # common = common_unique[common_inv]
    b_unique, b_ind = np.unique(b_vec, return_index=True)  # b_unique = b_vec[b_ind]
    common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
    return bool_ind, common_ind[common_inv]

An alternate implementation that is a bit (~5x) slower but doesn't use the unique function is here:

def ismember(a_vec, b_vec):
    ''' MATLAB equivalent ismember function. Slower than above implementation'''
    b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
    indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
    booleans = np.in1d(a_vec, b_vec)
    return booleans, np.array(indices, dtype=int)