I have two very large arrays X_pos
and X_neg
with sizes (900000,100)
and (1200000,100)
. I'd like to find out if any row element of arr1
is present on arr2
. If there is a match, I need a list of indices for the matching rows. Later I need to remove them and make both arrays having no same row match. For now, I'm using for
loops:
list_conflict=[]
for i in range (len(X_pos)):
for j in range (len(X_neg)):
if (X_pos[i]==X_neg[j]):
list_conflict.append([i,j])
fault_index_pos = np.unique([x[0] for x in list_conflict])
fault_index_neg = np.unique([x[1] for x in list_conflict])
X_neg = np.delete(X_neg,fault_index_neg,axis=0)
X_pos = np.delete(X_pos,fault_index_pos,axis=0)
It takes an element of X_pos
on outer loop and compares it with every element of X_neg
exhaustively. If finds a match, appends indices list_conflict
with first element being X_pos
position and second X_pos
. Then fault_index_pos
and fault_index_neg
are squeezed into unique elements, since an element of X_pos
could be on multiple places of X_pos
and list will have recurrent positions. Finally, matching elements are removed with np.delete
by taking fault_index
lists as index to be deleted.
I'm looking for a faster approach for conflict comparison call it set
, vectorization
or anything else.
P.S. Example arrays:
X_pos
array([[0,1,2,3,4],[0,1,2,3,5],[0,1,2,3,6],[2,4,5,6,7]])
X_neg
array([[2,4,5,6,7],[0,1,2,3,7],[0,1,2,3,4],[0,1,2,3,9],[1,9,3,2,5]])
It should return the indices that are duplicate in the other array which is list_conflict = [[0,1],[3,1]]
. Then by these indices, I should be able to delete same elements in both arrays. Possible duplicate question is what I asked yesterday, but it fails to precisely explain the problem with unnecessary simplification. Those answers are not a solution to this problem and they had to be altered in some way to work.