remove duplicate elements from two numpy arrays

Question

I have two numpy arrays a and b, with twenty million elements (float number). If the combination elements of those two arrays are the same, then we call it duplicate, which should be remove from the two arrays. For instance,

a = numpy.array([1,3,6,3,7,8,3,2,9,10,14,6])
b = numpy.array([2,4,15,4,7,9,2,2,0,11,4,15])

From those two arrays, we have a[2]&b[2] is the same as a[11]&b[11], then we call it duplicate element, which should be removed. The same as a[1]&b[1] vs a[3]&b[3]Although each array has duplicate elements itself, they are not treated as duplicate elements. So I want the returned arrays to be:

a = numpy.array([1,3,6,7,8,3,2,9,10,14])
b = numpy.array([2,4,15,7,9,2,2,0,11,4])

Anyone has the cleverest way to implement such reduction?

Ah, misread the question, `np.vstack({tuple(row) for row in np.column_stack((a,b))})` should work, may not give your first unique element order. — Zero, Mar 26 '16 at 06:44
I don't know how costly it is, but can you make them as one array like `np.vstack((a,b)).T` then use [this answer](http://stackoverflow.com/a/16973510/3349443) to find unique rows? — Mahdi, Mar 26 '16 at 07:07

B. M. · Accepted Answer · 2016-03-26T16:35:42.027

3

First you have to pack a and b to identify duplicates. If values are positive integers (see the edit in other cases), this can be achieved by :

base=a.max()+1
c=a+base*b

Then just find unique values in c:

val,ind=np.unique(c,return_index=True)

and retrieve the associated values in a and b.

ind.sort()
print(a[ind])
print(b[ind])

for the disparition of the duplicate. (two here):

[ 1  3  6  7  8  3  2  9 10 14]
[ 2  4 15  7  9  2  2  0 11  4]

EDIT

regardless of datatype, the c array can be made as follow, packing data to bytes :

ab=ascontiguousarray(vstack((a,b)).T) 
dtype = 'S'+str(2*a.itemsize)
c=ab.view(dtype=dtype)

edited Mar 26 '16 at 16:35

answered Mar 26 '16 at 15:50

B. M.

18,243
2
35
54

Thank you. The array values are positive, but float number. – Huanian Zhang Mar 26 '16 at 15:52
I have provided a more general approach. – B. M. Mar 26 '16 at 16:22
Thank you very much. It works. Your code is concise, but not that easy to get it~~~ – Huanian Zhang Mar 26 '16 at 16:51
If I want to extend to 3 arrays, the only thing I need to change is the np.vstack, right? I need to ab = np.vstack((a,b)), abc = np.ascontiguousarray(np.vstack((ab,c)).T), then the rest is the same, right? – Huanian Zhang Mar 27 '16 at 03:42

Reti43 · Answer 2 · 2016-03-26T10:49:44.020

This is done in one pass and without requiring any extra memory for the resulting arrays.

Pair up the elements at each index and iterate over them. Keep a track of which pairs have been seen so far and a counter of the index of the arrays. When a new pair has not been seen before, the index will increase by 1, effectively writing them back to their original place. However, for a duplicate pair you don't increase the index, effectively shifting every new pair one position to the left. At the end, keep the first indexth number of elements to shorten the arrays.

import itertools as it

def delete_duplicate_pairs(*arrays):
    unique = set()
    arrays = list(arrays)
    n = range(len(arrays))
    index = 0
    for pair in it.izip(*arrays):
        if pair not in unique:
            unique.add(pair)
            for i in n:
                arrays[i][index] = pair[i]
            index += 1
    return [a[:index] for a in arrays]

If you are on Python 2, zip() creates the list of pairs up front. If you have a lot of elements in your arrays, it'll be more efficient to use itertools.izip() which will create the pairs as you request them. However, zip() in Python 3 behaves like that by default.

For your case,

>>> import numpy as np
>>> a = np.array([1,3,6,3,7,8,3,2,9,10,14,6])
>>> b = np.array([2,4,15,4,7,9,2,2,0,11,4,15])
>>> a, b = delete_duplicate_pairs(a, b)
>>> a
array([ 1,  3,  6,  7,  8,  3,  2,  9, 10, 14])
>>> b
array([ 2,  4, 15,  7,  9,  2,  2,  0, 11,  4])

Now, it all comes down to what values your arrays hold. If you have only the values 0-9, there are only 100 unique pairs and most elements will be duplicates, which saves you time. For 20 million elements for both a and b and containing values only between 0-9, the process completes in 6 seconds. For values between 0-999, it takes 12 seconds.

Thank you very much, I will try it. But I forgot to mention, the two arrays I have contain float number spanning between 0 - 50. — Huanian Zhang, Mar 26 '16 at 15:52

remove duplicate elements from two numpy arrays

2 Answers2