Python: how to avoid loop?

Question

I have a list of entries

l = [5, 3, 8, 12, 24]

and a matrix M

M:
12  34  5  8  7
0   24   12 3  1

I want to find the indeces of the matrix where appear the numbers in l. For the k-entry of l I want to save a random couple of indices i, j where M[i][j]==l[k]. I am doing the following

indI = []
indJ = []
for i in l:
    tmp = np.where(M == i)
    rd = randint(len(tmp))
    indI.append(tmp[0][rd])
    indJ.append(tmp[1][rd])

I would like to see if there is a way to avoid that loop

*Why* are you trying to avoid the loop? Does the loop not work? Does it take too long? Do you just not like how it looks? — Rob Watts, Mar 10 '17 at 20:23
@RobWatts: The loop works but it takes too long because of the size of `l` and `M` — emax, Mar 10 '17 at 20:25
[Disassembly](http://stackoverflow.com/a/869600/5994041) the code to check [what you can "skip"](http://stackoverflow.com/a/869347/5994041) e.g. with other type of loop, list comprehension, or something else from pure Python. If that doesn't help, go for Cython. — Peter Badida, Mar 10 '17 at 20:31
This fails if the given element of `l` is not in `M` (which happens for instance for 24, in the example given). — fuglede, Mar 10 '17 at 20:32
What scale are we talking about for each of `l` and `M`? Thousands of elements? Millions of elements? — Rob Watts, Mar 10 '17 at 20:37
@RobWatts `l` is 2 million while `M = 2500x2500`. @fugiede I corrected the question — emax, Mar 10 '17 at 20:45
Still doesn't run, because `len(tmp)` is always `2`, whereas tmp[0] or tmp[1] might not be of that length. — Divakar, Mar 10 '17 at 20:55

score 0 · Answer 1 · answered Mar 10 '17 at 21:01

One way in which you should be able to significantly speed up your code is to avoid duplicate work:

tmp = np.where(M == i)

As this gives you a list of all locations in M where the value is equal to i, it must be searching through the entire matrix. So for each element in l, you are searching through the full matrix.

Instead of doing that, try indexing your matrix as a first step:

matrix_index = {}
for i in len(M):
    for j in len(M[i]):
        if M[i][j] not in matrix_index:
            matrix_index[M[i][j]] = [(i,j)]
        else:
            matrix_index[M[i][j]].append((i,j))

Then for each value in l, instead of doing a costly search through the full matrix, you can just get it straight from your matrix index.

_{Note: I haven't with numpy very much, so I may have gotten the specific syntax incorrect. There may also be a more idiomatic way of doing this in numpy.}

fuglede · Answer 2 · 2017-03-10T22:49:43.600

One solution that does not use the word for is

c = np.apply_along_axis(lambda row: np.random.choice(np.argwhere(row).ravel()), 1, M.ravel()[np.newaxis, :] == l[:, np.newaxis])
indI, indJ = c // M.shape[1], c % M.shape[1]

Note that while that solves the problem, M.ravel()[np.newaxis, :] == l[:, np.newaxis] will quickly produce MemoryErrors. A more pragmatic approach would be to get the indices of interest through something like

s = np.argwhere(M.ravel()[np.newaxis, :] == l[:, np.newaxis])

and then do the random choice post-processing by hand. This, however, probably does not yield any significant performance improvements over your search.

What makes it slow, though, is that you search through the entire matrix in every step of your loop; by pre-sorting the matrix (at a certain cost) gives you a straightforward way of making each individual search much faster:

In [312]: %paste
def direct_search(M, l):
    indI = []
    indJ = []
    for i in l:
        tmp = np.where(M == i)
        rd = np.random.randint(len(tmp[0]))  # Note the fix here
        indI.append(tmp[0][rd])
        indJ.append(tmp[1][rd])
    return indI, indJ

def using_presorted(M, l):
    a = np.argsort(M.ravel())
    M_sorted = M.ravel()[a]
    def find_indices(i):
        s = np.searchsorted(M_sorted, i)
        j = 0
        while M_sorted[s + j] == i:
            yield a[s + j]
            j += 1
    indices = [list(find_indices(i)) for i in l]
    c = np.array([np.random.choice(i) for i in indices])
    return c // M.shape[1], c % M.shape[1]

## -- End pasted text --

In [313]: M = np.random.randint(0, 1000000, (1000, 1000))

In [314]: l = np.random.choice(M.ravel(), 1000)

In [315]: %timeit direct_search(M, l)
1 loop, best of 3: 4.76 s per loop

In [316]: %timeit using_presorted(M, l)
1 loop, best of 3: 208 ms per loop

In [317]: indI, indJ = using_presorted(M, l)  # Let us check that it actually works

In [318]: np.all(M[indI, indJ] == l)
Out[318]: True

Murray Lee · Answer 3 · 2017-03-10T21:48:48.617

If both l and M are not large matrices like the following:

    In: l0 = [5, 3, 8, 12, 34, 1, 12]
    In: M0 = [[12, 34,  5,  8,  7],
    In:       [ 0, 24, 12,  3,  1]]

    In: l = np.asarray(l)
    In: M = np.asarray(M)

You can try this:

    In: np.where(l[None, None, :] == M[:, :, None])

    Out:
        (array([0, 0, 0, 0, 0, 1, 1, 1, 1]),   <- i
         array([0, 0, 1, 2, 3, 2, 2, 3, 4]),   <- j
         array([3, 6, 4, 0, 2, 3, 6, 1, 5]))   <- k

The rows should be the i, j, k, respectively and read the column to get every (i, j, k) you need. For example, the 1st column [0, 0, 3] means M[0, 0] = l[3], and the 2nd column [0, 0, 6] says M[0, 0] = l[6], and vice versa. I think these are what you want.

However, the numpy trick can not be extended to very large matrices, such as 2M elements in l or 2500x2500 elements in M. They need quite a lot memory and very very long time to compute... if they are lucky not to crash for out of memory. :)

Python: how to avoid loop?

3 Answers3