Doing some research, I found that checking whether two lists are disjoint runs in O(n+m), whereby n and m are the lengths of the lists (see here). The idea is that instertion and lookup of elements run in constant time for hash maps. Therefore, inserting all elements from the first list into a hashmap takes O(n) operations, and checking for each element in the second list whether it is already in the hash map takes O(m) operations. Therefore, solutions based on sorting, which run in O(n log(n) + m log(m)), are not optimal asymptotically.
Though the solutions by @Divakar are highly efficient in many use cases, they are less efficient, if the second dimension is large. Then, a solution based on hash maps is better suited. I have implemented it as follows in cython:
import numpy as np
cimport numpy as np
import cython
from libc.math cimport NAN
from libcpp.unordered_map cimport unordered_map
np.import_array()
@cython.boundscheck(False)
@cython.wraparound(False)
def get_common_element2d(np.ndarray[double, ndim=2] arr1,
np.ndarray[double, ndim=2] arr2):
cdef np.ndarray[double, ndim=1] result = np.empty(arr1.shape[0])
cdef int dim1 = arr1.shape[1]
cdef int dim2 = arr2.shape[1]
cdef int i, j
cdef unordered_map[double, int] tmpset = unordered_map[double, int]()
for i in range(arr1.shape[0]):
for j in range(dim1):
# insert arr1[i, j] as key without assigned value
tmpset[arr1[i, j]]
for j in range(dim2):
# check whether arr2[i, j] is in tmpset
if tmpset.count(arr2[i,j]):
result[i] = arr2[i,j]
break
else:
result[i] = NAN
tmpset.clear()
return result
I have created test cases as follows:
import numpy as np
import timeit
from itertools import starmap
from mycythonmodule import get_common_element2d
m, n = 3000, 3000
a = np.random.rand(m, n)
b = np.random.rand(m, n)
for i, row in enumerate(a):
if np.random.randint(2):
common = np.random.choice(row, 1)
b[i][np.random.choice(np.arange(n), np.random.randint(min(n,20)), False)] = common
# we need to copy the arrays on each test run, otherwise they
# will remain sorted, which would bias the results
%timeit [set(aa).intersection(bb) for aa, bb in zip(a.copy(), b.copy())]
# returns 3.11 s ± 56.8 ms
%timeit list(starmap(np.intersect1d, zip(a.copy(), b.copy)))
# returns 1.83 s ± 55.4
# test sorting method
# divakarsMethod1 is the appraoch #1 in @Divakar's answer
%timeit divakarsMethod1(a.copy(), b.copy())
# returns 1.88 s ± 18 ms
# test hash map method
%timeit get_common_element2d(a.copy(), b.copy())
# returns 1.46 s ± 22.6 ms
These results seem to indicate that the naive approach is actually better than some vectorized versions. However, the vectorized algorithms play out their strengths, if many rows with fewer columns are considered (a different use case). In these cases, the vectorized approaches are more than 5 times faster than the naive appraoch and the sorting method turns out to be best.
Conclusion: I will go with the HashMap-based cython version, because it is among the most efficient variants in both use cases. If I had to set up cython first, I would use the sorting-based method.