3

I have two sorted, numpy arrays similar to these ones:

x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])

Elements never repeat in the same array. I want to figure out a way of pythonicaly figuring out a list of indexes that contain the locations in the arrays at which the same element exists.

For instance, 1 exists in x and y at index 0. Element 2 in x doesn't exist in y, so I don't care about that item. However, 8 does exist in both arrays - in index 2 in x but index 1 in y. Similarly, 15 exists in both, in index 4 in x, but index 2 in y. So the outcome of my function would be a list that in this case returns [[0, 0], [2, 1], [4, 2]].

So far what I'm doing is:

def get_indexes(x, y):
    indexes = []
    for i in range(len(x)):
        # Find index where item x[i] is in y:
        j = np.where(x[i] == y)[0]

        # If it exists, save it:
        if len(j) != 0:
            indexes.append([i, j[0]])

    return indexes

But the problem is that arrays x and y are very large (millions of items), so it takes quite a while. Is there a better pythonic way of doing this?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Néstor
  • 585
  • 8
  • 20
  • Does this answer your question? ['in' for two sorted lists with the lowest complexity](https://stackoverflow.com/questions/65789112/in-for-two-sorted-lists-with-the-lowest-complexity) – Tomerikoo Mar 24 '21 at 13:06
  • Hi @Tomerikoo, thanks for the pointer! I think it is different enough as I'm interested in the _indexes_, not only if they exist on both or not. That's an added complexity of this question I believe? – Néstor Mar 24 '21 at 13:27
  • @Tomerikoo--seems all answers in that link use explicit Python loops which would be slower than a couple of answers here which avoid that. – DarrylG Mar 24 '21 at 13:30
  • Well the OP of that question actually said something about indexes which everyone elegantly decided to ignore ^_^. Indeed it is not the same. Will remove my vote to close, will leave the comment as they are related – Tomerikoo Mar 24 '21 at 13:37

5 Answers5

2

Without Python loops

Code

def get_indexes_darrylg(x, y):
    ' darrylg answer '
    # Use intersect to find common elements between two arrays
    overlap = np.intersect1d(x, y)
    
    # Indexes of common elements in each array
    loc1 = np.searchsorted(x, overlap)
    loc2 = np.searchsorted(y, overlap)
    
    # Result is the zip two 1d numpy arrays into 2d array
    return np.dstack((loc1, loc2))[0]

Usage

x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
result = get_indexes_darrylg(x, y)

# result[0]: array([[0, 0],
                    [2, 1],
                    [4, 2]], dtype=int64)

Timing Posted Solutions

Results show that darrlg code has the fastest run time.

enter image description here

Code Adjustment

  • Each posted solution as a function.
  • Slight mod so that each solution outputs an numpy array.
  • Curve named after poster

Code

import numpy as np
import perfplot

def create_arr(n):
    ' Creates pair of 1d numpy arrays with half the elements equal '
    max_val = 100000     # One more than largest value in output arrays
    
    arr1 = np.random.randint(0, max_val, (n,))
    arr2 = arr1.copy()
    
    # Change half the elements in arr2
    all_indexes = np.arange(0, n, dtype=int)
    indexes = np.random.choice(all_indexes, size = n//2, replace = False) # locations to make changes
    
    
    np.put(arr2, indexes, np.random.randint(0, max_val, (n//2, )))        # assign new random values at change locations
   
    arr1 = np.sort(arr1)
    arr2 = np.sort(arr2)
    
    return (arr1, arr2)

def get_indexes_lllrnr101(x,y):
    ' lllrnr101 answer '
    ans = []
    i=0
    j=0
    while (i<len(x) and j<len(y)):
        if x[i] == y[j]:
            ans.append([i,j])
            i += 1
            j += 1
        elif (x[i]<y[j]):
            i += 1
        else:
            j += 1
    return np.array(ans)

def get_indexes_joostblack(x, y):
    'joostblack'
    indexes = []
    for idx,val in enumerate(x):
        idy = np.searchsorted(y,val)
        try:
            if y[idy]==val:
                indexes.append([idx,idy])
        except IndexError:
            continue  # ignore index errors
            
    return np.array(indexes)

def get_indexes_mustafa(x, y):
    indices_in_x = np.flatnonzero(np.isin(x, y))                 # array([0, 2, 4])
    indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x]))   # array([0, 1, 2]
    
    return np.array(list(zip(indices_in_x, indices_in_y)))

def get_indexes_darrylg(x, y):
    ' darrylg answer '
    # Use intersect to find common elements between two arrays
    overlap = np.intersect1d(x, y)
    
    # Indexes of common elements in each array
    loc1 = np.searchsorted(x, overlap)
    loc2 = np.searchsorted(y, overlap)
    
    # Result is the zip two 1d numpy arrays into 2d array
    return np.dstack((loc1, loc2))[0]

def get_indexes_akopcz(x, y):
    ' akopcz answer '
    return np.array([
        [i, j]
        for i, nr in enumerate(x)
        for j in np.where(nr == y)[0]
    ])

perfplot.show(
    setup = create_arr,  # tuple of two 1D random arrays
    kernels=[
        lambda a: get_indexes_lllrnr101(*a),
        lambda a: get_indexes_joostblack(*a),
        lambda a: get_indexes_mustafa(*a),
        lambda a: get_indexes_darrylg(*a),
        lambda a: get_indexes_akopcz(*a),
    ],
    labels=["lllrnr101", "joostblack", "mustafa", "darrylg", "akopcz"],
    n_range=[2 ** k for k in range(5, 21)],
    xlabel="Array Length",
    # More optional arguments with their default values:
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    equality_check=None, #np.allclose,  # set to None to disable "correctness" assertion
    # show_progress=True,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)
DarrylG
  • 16,732
  • 2
  • 17
  • 23
1

What you are doing is O(nlogn) which is decent enough.
If you want, you can do it in O(n) by iterating on both arrays with two pointers and since they are sorted, increase the pointer for the array with smaller object.

See below:

x = [1, 2, 8, 11, 15]
y = [1, 8, 15, 17, 20, 21]

def get_indexes(x,y):
    ans = []
    i=0
    j=0
    while (i<len(x) and j<len(y)):
        if x[i] == y[j]:
            ans.append([i,j])
            i += 1
            j += 1
        elif (x[i]<y[j]):
            i += 1
        else:
            j += 1
    return ans

print(get_indexes(x,y))

which gives me:

[[0, 0], [2, 1], [4, 2]]
lllrnr101
  • 2,288
  • 2
  • 4
  • 15
1

Although, this function will search for all the occurances of x[i] in the y array, if duplicates are not allowed in y it will find x[i] exactly once.

def get_indexes(x, y):
    return [
        [i, j]
        for i, nr in enumerate(x)
        for j in np.where(nr == y)[0]
    ]
akocz
  • 138
  • 6
0

You can use numpy.searchsorted:

def get_indexes(x, y):
    indexes = []
    for idx,val in enumerate(x):
        idy = np.searchsorted(y,val)
        if y[idy]==val:
            indexes.append([idx,idy])
    return indexes
joostblack
  • 2,465
  • 5
  • 14
0

One solution is to first look from x's side to see what values are included in y by getting their indices through np.isin and np.flatnonzero, and then use the same procedure from the other side; but instead of giving x entirely, we give only the (already found) intersected elements to gain time:

indices_in_x = np.flatnonzero(np.isin(x, y))                 # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x]))   # array([0, 1, 2])

Now you can zip them to get the result:

result = list(zip(indices_in_x, indices_in_y))               # [(0, 0), (2, 1), (4, 2)]
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38