12

I have two numpy arrays of the same length that contain binary values

import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

I want to compute the hamming distance between them as fast as possible since I have millions of such distance computations to make.

A simple but slow option is this (taken from wikipedia):

%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop

I have come up with faster options, inspired by some answers here on stack overflow.

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop

I'm wondering if there are even faster ways to compute this, possibly using cython?

benbo
  • 1,471
  • 1
  • 16
  • 29
  • Are the lengths of the example arrays `a` and `b` the same as the lengths of your real data? – Warren Weckesser Sep 23 '15 at 03:11
  • 2
    Are you calculating all pairwise distances within an array of arrays, or between two arrays of arrays? You might be able to use [`scipy.spatial.distance.cdist`](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.cdist.html) or [`scipy.spatial.distance.pdist`](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.pdist.html) – user2034412 Sep 23 '15 at 03:13
  • @WarrenWeckesser they are of the same order, yes. They will be between the length of 20 and 100 depending on some parameter settings. – benbo Sep 23 '15 at 03:14
  • scipy/spatial/distance.py hamming(u, v): ... `return (u != v).mean()` . See also [bitarray](https://github.com/ilanschnell/bitarray). – denis Jan 26 '16 at 10:35

5 Answers5

24

There is a ready numpy function which beats len((a != b).nonzero()[0]) ;)

np.count_nonzero(a!=b)
yevgeniy
  • 888
  • 7
  • 14
8

Compared to 1.07µs for np.count_nonzero(a!=b) on my platform, gmpy2.hamdist gets its down to about 143ns after conversion of each array to an mpz (multiple-precison integer):

import numpy as np
from gmpy2 import mpz, hamdist, pack

a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

Based on a tip from @casevh, conversion from a 1D array of ones and zeros to a gmpy2 mpz object can be done reasonably efficiently with gmpy2.pack(list(reversed(list(array))),1).

# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)

hamdist(ampz,bmpz)
Out[8]: 7

%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop

for relative comparison, on my platform:

%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop

%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop   
  • 3
    To be fair, you should probably include the time it takes to convert the input arrays to mpz format. – Warren Weckesser Sep 23 '15 at 11:48
  • 3
    You can use `gmpy2.pack(list(a),1)` to convert the numpy array to an mpz. It is faster than `convert2mpz()`. If you include the conversion time, it will still be slower than the numpy solutions. – casevh Sep 23 '15 at 12:55
  • @WarrenWeckesser: I thought about that and sort of agree. What bothers me is that numpy data is obviously in optimal format for a numpy solution while most hamming distance algorithms in C, which take numeric input of some kind, operate at the bit level. It seems to me that seriousness about hamming distance computations performing well implies not using an array to represent a sequence of bits, since that's only one number. The purpose of my answer is to provide a data point for just hamming distance performance with a reasonably well C coded Python module. –  Sep 23 '15 at 16:41
  • @casevh: Thanks for the tip. Found it necessary to use gmpy2.pack(list(reversed(list(a))),1) which takes about 5.47µs on my platform. –  Sep 23 '15 at 17:27
  • 1
    You do need to use `reversed()` if you want to construct the same mpz as your original code. However, the Hamming distance doesn't depend on the order of the bits (i.e. high-to-low versus low-to high). As long as the two arrays have the same length so that the same bit positions are compared against each other, the Hamming distance will be the same. – casevh Sep 24 '15 at 05:34
  • gmpy2 is THE way to go. The time has reduced from 1.5 us per hash to 0.21 us per hash (211 ns). This is a 7x speedup, only 5.44s on a 27m list. Calculated on an array of 16x16 boolean arrays VS python list of gmpy2 mpz's built from 256 length binary. – Bart Apr 07 '16 at 18:03
  • 1
    Anyone know if something's changed since this was posted? When I try to use pack after copy-pasting the imports and definitions of a & b in this post, I get the error: TypeError: pack() requires list elements be positive integers < 2^n bits – Daniel Crane Apr 04 '17 at 05:06
  • I also have same problem as @DanielCrane – afruzan Jul 31 '19 at 07:09
  • 1
    For me it worked changing `list(a)` to `a.tolist()` – Carlo Bono Apr 22 '20 at 17:07
6

Using pythran can bring extra benefit here:

$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
    return len(nonzero(a != b)[0])

As a reference (without pythran):

$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop

While after pythran compilation:

$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop

That's roughly a 6x speedup over the numpy implementation, as pythran skips the creation of an intermediate array when evaluating the element wise comparison.

I also measured:

def hamm(a,b):
    return count_nonzero(a != b)

And I get 3.11 usec per loop for the Python version and 0.427 usec per loop with the Pythran one.

Disclaimer: I'm one of the Pythran dev.

Greg Sadetsky
  • 4,863
  • 1
  • 38
  • 48
serge-sans-paille
  • 2,109
  • 11
  • 9
0

for strings it works faster

def Hamm(a, b):
    c = 0
    for i in range(a.shape[0]):
        if a[i] != b[i]:
            c += 1
    return c
Micra
  • 1
  • 2
0

I suggest you convert the numpy bit array into an numpy uint8 array using np.packbits

Have a look at scipy's spatial.distance.hamming: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

otherwise, here is a small snippet which needs only numpy inspired by Fast way of counting non-zero bits in positive integer :

bit_counts = np.array([int(bin(x).count("1")) for x in range(256)]).astype(np.uint8)
def hamming_dist(a,b,axis=None):
    return np.sum(bit_counts[np.bitwise_xor(a,b)],axis=axis)

with axis=-1, this allows to take the hammig distance between an entry and a large array; eg:

inp = np.uint8(np.random.random((512,8))*255) #512 entries of 8 byte
hd = hamming_dist(inp, inp[123], axis=-1) #results in 512 hamming distances to entry 123
idx_best = np.argmin(hd)    # should point to identity 123
hd[123] = 255 #mask out identity
idx_nearest= np.argmin(hd)    # should point entry in list with shortest distance to entry 123
dist_hist = np.bincount(np.uint8(hd)) # distribution of hamming distances; for me this started at 18bits to 44bits with a maximum at 31
Oliver Zendel
  • 2,695
  • 34
  • 29