Efficient method to find duplicates in arrays

Question

I am working on a program in Python that, right now, goes row by row through a 2D numpy array and looks for an identical row in a different array. If it finds a duplicate it will run a small bit of code using that index of the first array.

This works fine and is efficient enough when the arrays are small (~2x500 and 2x500) but quickly becomes inefficient for longer arrays. I am wondering if someone knows of a method using numpy (I am currently using other numpy features elsewhere so it would be best to not have to change data types), or perhaps something else that would be more efficient. I am sure there is something faster than two for loops through the arrays. Thanks in advance.

import random
import numpy as np
N = 1000
speed = 50
longueur = 20000          
largeur =  30000          
quadrillage = 50 
p= 0.8               
def stick():
    u = random.random()
    if u <p:
        a = 1   #The particle is stuck
    else:
        a =0    #The particle did not stick, it will instead bounce
    return a 
obstacle_number =2000   
maxstuck = 4 
numbstuck = np.zeros((obstacle_number)) 

spacinglarg = largeur/quadrillage
spacinglong = longueur/quadrillage
obs0 = np.random.randint(0, spacinglarg,(obstacle_number,1)) *quadrillage
obs1 =  np.random.randint(0, spacinglong,(obstacle_number,1)) *quadrillage
obs = np.concatenate([obs0,obs1], axis =1)

s=(N,2)
global A
A = np.zeros(s)
for i in range (0,N):
    a = i*longueur/N
    b = 50
    A[i,0]= b
    A[i,1]= a


T = 50*np.round(A/(50))

B=np.zeros(s)
tp = 2*np.pi
for i in range(0,nombre_atomes):
    aa = random.randint(0,360)/tp
    B[i,0]=np.cos(aa)*speed
    B[i,1]=np.sin(aa)*speed


for i in range(0, N):
    for j in range(0,len(obs)):
        if T[i,0] == obs[j,0] and T[i,1] == obs[j,1]:
            if numbstuck[j] <= maxstuck and abs(B[i,0]) != 0:    
                sss= stick()
                if sss == 1: #if it sticks
                    B[i,0]=0
                    B[i,1]=0
                    numbstuck[j] += 1
                else:
                    B[i,0]=-B[i,0] 
                    B[i,1]=-B[i,1]

Please provide an [MCVE]. If I paste your code into my editor as-it, it will raise name errors — Mad Physicist, May 27 '21 at 21:02
Please ping me when you do. It's likely that your program can be completely vectorized, meaning a speedup of 10-100x — Mad Physicist, May 27 '21 at 21:11
@MadPhysicist I updated the code. It runs for me. I left out some parts to be concise but the functionality may be a little unclear now. The two for loops are actually in a function that is called repetitively (up to 500 times). The top bit of code is only run once and initializes the simulations. Cheers Kyle — Kyle, May 27 '21 at 21:42
You can replace the entire call to `stick()` with the expression `np.random.random() < p`. Booleans are integers in python with value 0 or 1 — Mad Physicist, May 27 '21 at 22:08
Just to clarify, you're trying to find all the elements of `T` that match an element of `obs`, allow at most the first `maxstuck` of those to stick to an obstacle with 80% probability? — Mad Physicist, May 27 '21 at 22:54
First problem, `abs(B[i,0]) != 0` will never be true: you start with an array of zeros. Similarly, negating `B` does nothing since it's always zero. Once you've clarified that bit of logic, you can reduce all this to a couple of masked array assignments, possibly a call to `searchsorted`. Certainly no need for all this `O(M * N)` looping. — Mad Physicist, May 27 '21 at 22:56
Yes that's correct, and thanks already for the first suggestion. — Kyle, May 27 '21 at 22:56
Actually the B I put in the code above is just a placeholder, such that it runs. I tried to leave off of unnecessary code to be readable, but neglected that being important. It does have all non-zero values in the beginning. I updated this in the code, to show the real values of B. I will look into the masked array assignments thanks. — Kyle, May 27 '21 at 23:00
One more question. Are your locations always integers? Or are you interested in using floats? — Mad Physicist, May 27 '21 at 23:50
This is probably your best bet for a starting point: https://stackoverflow.com/q/38674027/2988730. However, I can tailor for your specific needs once I understand the dtype better. Floats and integers have different methods of packing. — Mad Physicist, May 28 '21 at 00:05
The locations are always integers. Right now the dtype is floats, but that could be changed. Other note: The array 'obs' does not change during a given run, only 'A', and thus 'T' as well change each run. — Kyle, May 28 '21 at 07:30
Does this answer your question? [Find the row indexes of several values in a numpy array](https://stackoverflow.com/questions/38674027/find-the-row-indexes-of-several-values-in-a-numpy-array) — Brian Tompsett - 汤莱恩, May 28 '21 at 10:19

score 0 · Answer 1 · answered May 27 '21 at 21:08

The general idea is just to write down the code in a simple fashion:

import numpy as np

size = 42
a = np.arange(size**2).reshape(size, size)
b = a.copy() + size * 5

def detect_duplicates(a, b):
    duplicates = []
    for i, row_a in enumerate(a):
        for j, row_b in enumerate(b):
            if np.all(row_a == row_b):
                duplicates.append((i, j))
    return duplicates

This is quite slow, though:

In [1]: %timeit detect_duplicates(a, b)
7.1 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But this can be sped up considerably using numba without changing a single line of code in the loop:

import numpy as np
import numba

size = 42
a = np.arange(size**2).reshape(size, size)
b = a.copy() + size * 5

@numba.njit
def detect_duplicates(a, b):
    duplicates = []
    for i, row_a in enumerate(a):
        for j, row_b in enumerate(b):
            if np.all(row_a == row_b):
                duplicates.append((i, j))
    return duplicates

It is now much faster:

In [1]: %timeit detect_duplicates(a, b)
235 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient method to find duplicates in arrays

1 Answers1