3

I have two arrays:

>>> import numpy as np
>>> a=np.array([2, 1, 3, 3, 3])
>>> b=np.array([1, 2, 3, 3, 3])

What is the fastest way of comparing these two arrays for equality of elements, regardless of the order?

EDIT I measured for the execution times of the following functions:

def compare1():        #works only for arrays without redundant elements
    a=np.array([1,2,3,5,4])
    b=np.array([2,1,3,4,5])
    temp=0
    for i in a:
        temp+=len(np.where(b==i)[0])
    if temp==5:
            val=True
    else:
            val=False
    return 0

def compare2():
    a=np.array([1,2,3,3,3])
    b=np.array([2,1,3,3,3])
    val=np.all(np.sort(a)==np.sort(b))
    return 0

def compare3():                        #thx to ODiogoSilva
    a=np.array([1,2,3,3,3])
    b=np.array([2,1,3,3,3])
    val=set(a)==set(b)
    return 0

import numpy.lib.arraysetops as aso
def compare4():                        #thx to tom10
    a=np.array([1,2,3,3,3])
    b=np.array([2,1,3,3,3])
    val=len(aso.setdiff1d(a,b))==0
    return 0

The results are:

>>> import timeit
>>> timeit.timeit(compare1,number=1000)
0.0166780948638916
>>> timeit.timeit(compare2,number=1000)
0.016178131103515625
>>> timeit.timeit(compare3,number=1000)
0.008063077926635742
>>> timeit.timeit(compare4,number=1000)
0.03257489204406738

Seems like the "set"-method by ODiogoSilva is the fastest.

Do you know other methods that I can test as well?

EDIT2 The runtime above was not the right measure for comparing arrays, as explained in a comment by user2357112.

#test.py
import numpy as np
import numpy.lib.arraysetops as aso

#without duplicates
N=10000
a=np.arange(N,0,step=-2)
b=np.arange(N,0,step=-2)

def compare1():
    temp=0
    for i in a:
        temp+=len(np.where(b==i)[0])
    if temp==len(a):
        val=True
    else:
        val=False
    return val
def compare2():
    val=np.all(np.sort(a)==np.sort(b))
    return val
def compare3():
    val=set(a)==set(b)
    return val
def compare4():
    val=len(aso.setdiff1d(a,b))==0
    return val

The output is:

>>> from test import *
>>> import timeit
>>> timeit.timeit(compare1,number=1000)
101.16708397865295
>>> timeit.timeit(compare2,number=1000)
0.09285593032836914
>>> timeit.timeit(compare3,number=1000)
1.425955057144165
>>> timeit.timeit(compare4,number=1000)
0.44780397415161133

Now compare2 is the fastest. Is there still a method that could outgun this?

Andy
  • 1,072
  • 2
  • 19
  • 33
  • You just want to know if they have the same elements? In this case 1,2,3? – ODiogoSilva Apr 26 '15 at 00:42
  • 1
    Sort both, then just compare I would guess. – Baum mit Augen Apr 26 '15 at 00:44
  • @ODiogoSilva yes, my first try is just to see if these arrays contain 1,2,3 – Andy Apr 26 '15 at 00:45
  • 1
    Try timing on bigger arrays, and don't include the array creation time in the timings. Right now, some of your tests are mostly measuring per-call overhead, and some of your tests aren't reflecting drastic slowdowns that occur with larger arrays. – user2357112 Apr 26 '15 at 02:38
  • 2
    Also, both of the answers you've received will consider `[1, 2, 2]` equivalent to `[1, 1, 2]`. Is that what you want? It doesn't look like it. I would recommend going with your `compare2`. – user2357112 Apr 26 '15 at 02:41
  • 1
    thx, I improved that to reflect the real slowdown for large arrays. Well, actually I have arrays with no duplicates, not like I stated in the very top. – Andy Apr 26 '15 at 11:55

2 Answers2

4

Numpy as a collection of set operations.

import numpy as np
import numpy.lib.arraysetops as aso

a=np.array([2, 1, 3, 3, 3])
b=np.array([1, 2, 3, 3, 3])

print aso.setdiff1d(a, b)
tom10
  • 67,082
  • 10
  • 127
  • 137
1

To see if both arrays contain the same kind of elements, in this case [1,2,3], you could do:

import numpy as np
a=np.array([2, 1, 3, 3, 3])
b=np.array([1, 2, 3, 3, 3])

set(a) == set(b)
# True
ODiogoSilva
  • 2,394
  • 1
  • 19
  • 20
  • I think sets remove duplicates. – Nick Bartlett Apr 26 '15 at 00:49
  • 1
    Yes it does, though the OP only wanted to see if the arrays contained 1,2,3 – ODiogoSilva Apr 26 '15 at 00:50
  • 1
    If OP really wants the fastest way to do this, it's probably best to stay within `numpy` and use the tools it provides, as these will probably be fastest for large numpy arrays. That said, if OP really wants the fastest way, it's kind of up to them to come up with meaningful test cases. – Marius Apr 26 '15 at 00:53