Hybridizing Numpy vectorization and plain Python

Question

I have a 2-dimensional ndarray, the rows of which shall be scanned to check whether any one is equal to any other.

My first try actually works, but I feel it is not the optimal way. It takes time once the number of rows in the matrix approaches 1000.

My code is the following. X is the aforementioned array, Y also is a 2-dimensional ndarray.

for i in range(X.shape[0]-1):
    for j in range(i+1,X.shape[0]):
        if (np.all( (X[i,:] == X[j,:] ), axis = 0 )):
            Y[j,:] = Y[i,:]
        #endif
    #enddo
#enddo

I know that the nested loop is time consuming and should be avoided, but I could not find an alternative. List comprehension seems to me not suitable in that there is no need to save items.

The fact that the core of the procedure is the assignment operation Y[j,:] = Y[i,:], which is index-dependent, would lead me to exclude a list comprehension-like solution.

Question then is: is there a more efficient way to code such a search exploiting numpy vectorization?

@Divakar `X` is thought to be a synthetic data set. Ideally I would have it with shape `(1000,8)`, even more rows possibly — matte, May 24 '19 at 15:09

Divakar · Accepted Answer · 2019-05-24T17:22:32.130

Approach #1

We could leverage row-views to get the pairwise matches. Then, run the loop and assign those in Y. The idea is to minimize the work once we start running loop. Considering that there could be more than one index matching with other indices, a purely vectorized method would be hard to propose. The implementation would look something like this -

# https://stackoverflow.com/a/44999009/ @Divakar
def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

# Get 1D view
a1D = view1D(a)

# Perform broadcasting to get outer equality match
mask = a1D[:,None]==a1D

# Get indices of pairwise matches
n = len(mask)
mask[np.tri(n, dtype=bool)] = 0
idx = np.argwhere(mask)

# Run loop to assign equal rows in Y
for (i,j) in zip(idx[:,0],idx[:,1]):
    Y[j] = Y[i]

Alternative #1 : Use mask to directly assign

So, with mask, directly assign the rows in Y, like so -

for i,m in enumerate(mask):
    if m.any():
        Y[m] = Y[i]

This would be helpful if there are many matches.

Alternative #2 : Use reduced mask

If there are more than one row common between two rows, we might want to reduce those to have all of those linked to the first occurring ones. Hence, we can generate a reduced-mask and use that instead of the earlier mask -

mask0 = np.zeros_like(mask)
mask0[mask.argmax(0), np.arange(len(mask))] = 1
np.fill_diagonal(mask0,0)

Then, use mask0 instead of mask and assign.

Approach #2

Another method would be starting off with the 1D view method and using sorting-based method to setup the pairwise matching indices, like so -

sidx = a1D.argsort() # a1D from earlier approach
b = a1D[sidx]
m0 = b[:-1] == b[1:]
m1 = np.r_[False,m0,False]
idx = np.flatnonzero(m1[:-1]!=m1[1:]).reshape(-1,2)
for (i,j) in idx:
    row0,row1 = sidx[i],sidx[i+1:j+1]
    Y[row1] = Y[row0]

It is a very interesting approach! Does it perhaps resemble the matrix-as-array-with-offset representation, like how commonly done in C code optimization? — matte, May 25 '19 at 07:33
@MatteoZambra Not aware of `matrix-as-array-with-offset representation`. Could you elaborate on it? — Divakar, May 25 '19 at 07:34
Maybe I am misunderstanding the underlying logic of your answer, but I saw some example of C code optimization where two dimensional arrays, say a `int mat[N][M]` is actually treated like a one dimensional array `int mat[N + M]`, indexing positions as `mat[i*M + j]` instead of `mat[i][j]`. As far as I could understand, it is made for the sake of speeding ip the fetching process, that is saving two dimensional arrays in a linear fashion, consistent with the linear feature of the RAM. There is [this]("https://stackoverflow.com/a/14015582/9136498") post and references therein that further deepen — matte, May 25 '19 at 07:44
@MatteoZambra Nah, that's not happening here. Say `row2 = [4,5,8]` and `row1 = 2`, then with `Y[row1] = Y[row0]`, first off it broadcasts `Y[row0]` to three copies and then assigns - `Y[4] = Y[2]`, `Y[5] = Y[2]`, `Y[8] = Y[2]` all in one go. Under the hoods, it doesn't need to make those copies of `Y[2]` though, so it's a lot efficient. Hope that makes sense. — Divakar, May 25 '19 at 08:12

score 1 · Answer 2 · answered May 24 '19 at 13:41

See the following example: As an illustration, consider a 1-dimensional vector of True and False for which you want to count the number of “False to True” transitions in the sequence:

np.random.seed(444)
x = np.random.choice([False, True], size=100000)

With a Python for-loop, one way to do this would be to evaluate, in pairs, the truth value of each element in the sequence along with the element that comes right after it:

def count_transitions(x) -> int:
  count = 0
  for i, j in zip(x[:-1], x[1:]):
     if j and not i:
        count += 1
  return count

count_transitions(x)

In vectorized form, there’s no explicit for-loop or direct reference to the individual elements:

np.count_nonzero(x[:-1] < x[1:])

How do these two equivalent functions compare in terms of performance? In this particular case, the vectorized NumPy call wins out by a factor of about 70 times

https://realpython.com/numpy-array-programming/

thank you for the answer, it is a clever trick. But that answers the question _how many_ rows equals each other, the main concern however here is _which ones_ equate? I mean: In the reported example, as well as in the page you linked, I could not find such a vectorization hack — matte, May 24 '19 at 15:12

score 0 · Answer 3 · answered May 24 '19 at 17:34

0

I'm on my phone so I can't test this, but I think it will work

mask = np.all(X[:, None] == X[None], axis=-1)
ind1, ind2 = np.nonzero(mask)
ind1, ind2 = ind1[ind1 < ind2], ind2[ind1 < ind2]
Y[ind2] = Y[ind1]

answered May 24 '19 at 17:34

Subhaneil Lahiri

408
3
10

Thank you for your answer. I do not get the `axis = -1` clause. I sought on the web but I could not find further info about. What does that axis mean? – matte May 25 '19 at 07:50
It means that `np.all` checks if all elements of the array are true across the *last* axis, for each index of the other axes. In this case it would be the same as `axis=2`. You can also use a tuple of ints to check across multiple axes. By default `np.all` checks across every axis. – Subhaneil Lahiri May 25 '19 at 18:17

Hybridizing Numpy vectorization and plain Python

3 Answers3

Approach #1

Alternative #1 : Use mask to directly assign

Alternative #2 : Use reduced mask

Approach #2