0

I have two arrays arr1 and arr2 with sizes (90000,1) and (120000,1). I'd like to find out if any element of axis=0 of arr1 is present on arr2. Then write their positions on to a list and later remove them. This will ensure that none of the elements on either lists could be found on the other. For now, I'm using for loops:

list_conflict=[]
for i in range (len(arr1)):
    for j in range (len(arr2)):
        if (arr1[i]==arr2[j]):
            list_conflict.append([i,j])

fault_index_pos = np.unique([x[0] for x in list_conflict])
fault_index_neg = np.unique([x[1] for x in list_conflict])

X_neg = np.delete(X_neg,fault_index_neg,axis=0)
X_pos = np.delete(X_pos,fault_index_pos,axis=0)

It takes an element of arr1 on outer loop and compares it with every element of arr2 exhaustively. If finds a match, appends indices list_conflict with first element being arr1 position and second arr2. Then fault_index_pos and fault_index_neg are squeezed into unique elements, since an element of arr1 could be on multiple places of arr2 and list will have recurrent positions. Finally, matching elements are removed with np.delete by taking fault_index lists as index to be deleted.

I'm looking for a faster approach for conflict comparison call it multiprocessing, vectorization or anything else. You could say it won't take much time but actually arrays are in (x,8,10) dimensions but I shortened them for sake of clarity.

martineau
  • 119,623
  • 25
  • 170
  • 301
colt.exe
  • 708
  • 8
  • 24

4 Answers4

3

Ignoring the numpy part, finding the conflicting index pairs can be done much faster in pure Python, taking time proportional to len(a) plus len(b) plus the number of conflicts, rather than the nested loops which take time proportional to the product of the vectors' lengths:

def conflicts(a, b):
    from collections import defaultdict
    elt2ix = defaultdict(list)
    for i, elt in enumerate(a):
        elt2ix[elt].append(i)
    for j, elt in enumerate(b):
        if elt in elt2ix:
            for i in elt2ix[elt]:
                yield i, j

Then, e.g.,

for pair in conflicts([1, 2, 4, 5, 2], [2, 3, 8, 4]):
    print(pair)

displays

(1, 0)
(4, 0)
(2, 3)

which are the indices of the matching occurrences of 2 and 4.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • How about creating reverse dictionaries from *both* inputs (maybe call them `elt2ix_a` and `elt2ix_b`) and then do: `for elt in set(elt2ix_a) & set(elt2ix_b): yield from itertools.product(elt2ix_a[elt], elt2ix_b[elt])` – alani Jul 13 '20 at 21:00
  • Would work, takes more memory, haven't timed either. Note that `set(dict1) & set(dict2)` can also be written `dict1.keys() & dict2.keys()`. Haven't timed that either ;-) – Tim Peters Jul 13 '20 at 21:08
  • Thanks, I didn't know you could do that with `dict_keys`. – alani Jul 13 '20 at 21:10
  • 2
    Python 3's "dict views" act like sets in several respects. Very brief overview here: https://stackoverflow.com/questions/47273297/python-understanding-dictionary-view-objects – Tim Peters Jul 13 '20 at 21:15
  • What if a and b are lists of lists instead of lists of integers? I get unhashable type: list error for `elt2ix[elt].append(i)` – colt.exe Jul 13 '20 at 21:58
  • There are many possible ways to proceed then, but - of course - they're more involved. Too involved to fit in a Stackoverflow comment box. Simplest would be to convert them to tuples. "But what if they're lists of lists _of_ lists?" is the next question then ;-) – Tim Peters Jul 13 '20 at 22:08
  • @TimPeters converting list of lists in to a lists of tuples makes this work for me. Thanks for the efforts. – colt.exe Jul 14 '20 at 12:11
1
  • I would use pandas, which sits on numpy, to create a Boolean mask
  • Takes 4 lines of code
import numpy as np
import pandas as pd

# create test data
np.random.seed(1)
a = np.random.randint(10, size=(10, 1))
np.random.seed(1)
b = np.random.randint(8, 15, size=(10, 1))

# create dataframe
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)

# find unique values in df_a
unique_a = df_a[0].unique().tolist()

# create a Boolean mask and return only values of df_b not found in df_a
values_not_in_a = df_b[~df_b[0].isin(unique_a)].to_numpy()

Values

a = array([[5],
           [8],
           [9],
           [5],
           [0],
           [0],
           [1],
           [7],
           [6],
           [9]])

b = array([[13],
           [11],
           [12],
           [ 8],
           [ 9],
           [11],
           [13],
           [ 8],
           [ 8],
           [ 9]])

# final output array
values_not_in_a = array([[13],
                         [11],
                         [12],
                         [11],
                         [13]])

Only using numpy

import numpy

# create test data
np.random.seed(1)
a = np.random.randint(10, size=(10, 1))
np.random.seed(1)
b = np.random.randint(8, 15, size=(10, 1))

ua = np.unique(a)  # unique values of a
ub = np.unique(b)  # unique values of b

mask_b = np.isin(b, ua, invert=True)
mask_a = np.isin(a, ub, invert=True)

b_values_not_in_a = b[mask_b]
a_values_not_in_b = a[mask_a]

# b_values_not_in_a
array([13, 11, 12, 11, 13])

# a_values_not_in_b
array([5, 5, 0, 0, 1, 7, 6])

timeit

# using the following arrays
np.random.seed(1)
a = np.random.randint(10, size=(90000, 1))
np.random.seed(1)
b = np.random.randint(8, 15, size=(120000, 1))

%%timeit
5.6 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • 1
    @0x5453 That's true, as I say in the first bullet of my post. However, using pandas, I don't need to transpose the shape of the array to create the mask. – Trenton McKinney Jul 13 '20 at 21:01
  • @colt.exe this changes the scope of the original question. (90000, 80) is 90k rows and 80 columns. Now it's not clear what you're trying accomplish. Are you now trying to remove all the values from a 1d array from each column of the 2d array? – Trenton McKinney Jul 13 '20 at 21:46
  • @colt.exe However, if I understand correctly, If you have a 1d array whose values you want removed from each column of the 2d array, you can do as I've shown, but use a loop to apply the the mask to each column of the 2d array. However, I don't think you can combine them back into an array, because the column lengths must be the same. – Trenton McKinney Jul 13 '20 at 21:57
0

Please work through some tutorials on the vector capabilities of NumPy, as well as the sequence inclusion operators of Python. You are trying to program a large-scale application that sorely needs language facilities you haven't yet learned.

That said, perhaps the fastest way to do this is to convert each to a set and take the set intersection. The involved operations are O(n) for a sequence/set of N elements; your nested loop is O(N*M) (on the two sequence sizes).

Any tutorial on Python sets will walk you through this.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Well I certainly thought of sets, but had an uneducated guess that there may be more faster ways for achieving this. However I do fully agree that I have to take some tutorials and will definitely to when I have spare time. Appraciate your suggestions. – colt.exe Jul 13 '20 at 20:47
0

As @Prune suggested, here's a solution that uses sets:

overlap = np.array(list(set(arr1) & set(arr2)))  # Depending on array shapes you may need to flatten or slice first
arr1 = arr1[~np.isin(arr1, overlap)]
arr2 = arr2[~np.isin(arr2, overlap)]
0x5453
  • 12,753
  • 1
  • 32
  • 61