Fastest way to identify all pairs of lists that their difference is lower than a given threshold when the overall list is very long (10000)

Question

ahi, everyone. sorry to bother you.

I have this task that I have a list of hash codings stored in a list with 30 positions with value 0 and 1. In total, I have over 10000 such 30 size (0/1) hash codes and I would like to find all pairs of such hash codes which have the difference lower than a given threshold (say 0, 1, 5), in which case this pair would be considered as "similar" hash codings.

I have realised this using nested "for loop" in python3 (see code below), but I do not feel it is efficient enough, as this seems to be a O(N^2), and it is indeed slow when N = 10000 or even larger.

My question would be is there better way we could speed this finding similar hash pairs up ? Ideally, in O(N) I suppose ?

Note by efficiency I mean finding similar pairs given thershold rather than generating hash codings (this is only for demonstration).

I have digged in this problem a little bit, all the answers I have found is talking about using some sort of collection tools to find identical pairs, but here I have a more general case that the pairs could also be similiar given a threshold.

I have provided the code that generates sample hashing codings and the current low efficient program I am using. I hope you may find this problem interesting and hopefully some better/smarter/senior programmer could lend me a hand on this one. Thanks in advance.

import random
import numpy as np

# HashCodingSize = 10
# Just use this to test the program
HashCodingSize = 100
# HashCodingSize = 1000
# What can we do when we have the list over 10000, 100000 size ? 
# This is where the problem is 
# HashCodingSize = 10000
# HashCodingSize = 100000

#Generating "HashCodingSize" of list with each element has size of 30
outputCodingAllPy = []
for seed in range(HashCodingSize):
    random.seed(seed)
    listLength = 30
    numZero = random.randint(1, listLength)
    numOne = listLength - numZero
    my_list = [0] * numZero + [1] * numOne
    random.shuffle(my_list)
    # print(my_list)
    outputCodingAllPy.append(my_list)

#Covert to np array which is better than python3 list I suppose?
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))

hashDiffThreshold = 0
#hashDiffThreshold = 1
#hashDiffThreshold = 5
loopRange = range(outputCodingAll.shape[0])
samePairList = []

#This is O(n^2) I suppose, is there better way ? 
for i in loopRange:
    for j in loopRange:
        if j > i:
            if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
                print("The pair (",  str(i), ", ", str(j), ") ")
                samePairList.append([i, j])

print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(samePairList)

Update3 Please refer to accepted answer for quick solution or for more info read the answer provided by me down below in the answer section not in question section

Update2 RAM problem when list size goes up to 100000, the current speed solution still has the problem of RAM (numpy.core._exceptions._ArrayMemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type int64). In this case, anyone who are interested in the speed but without large RAM may consider parallel programming the original method **

Update with current answers and benchmarking tests:

I have briefly tested the answer provided by @Raibek, and it is indeed much faster than the for loop and has incoporated most of suggestions provided by others (many thanks to them as well). For now my problem is resolved, for anyone who are further interested in this problem, you could refer to @Raibek in accepted answer or to see my own test program below:

Hint: For people who are absolutely in short of time on their project, what you need to do is to take function "bits_to_int" and "find_pairs_by_threshold_fast" to home, and first convert 0/1 bits to integers, and using XOR to find all the pairs that smaller than a threshold. Hope this helps faster.

from logging import raiseExceptions
import random
import numpy as np
#check elapsed time
import time


# HashCodingSize = 10
# HashCodingSize = 100
HashCodingSize = 1000
# What can we do when we have the list over 10000, 100000 size ? 
# HashCodingSize = 10000
# HashCodingSize = 100000

#Generating "HashCodingSize" of list with each element has 30 size
outputCodingAllPy = []
for seed in range(HashCodingSize):
    random.seed(seed)
    listLength = 30
    numZero = random.randint(1, listLength)
    numOne = listLength - numZero
    my_list = [0] * numZero + [1] * numOne
    random.shuffle(my_list)
    # print(my_list)
    outputCodingAllPy.append(my_list)

#Covert to np array which is better than python3 list
#Study how to convert bytes to integers 
outputCodingAll = np.asarray(outputCodingAllPy)
print(outputCodingAll)
print("The N is", len(outputCodingAll))

hashDiffThreshold = 0
def myWay():
    loopRange = range(outputCodingAll.shape[0])
    samePairList = []

    #This is O(n!) I suppose, is there better way ? 
    for i in loopRange:
        for j in loopRange:
            if j > i:
                if (sum(abs(outputCodingAll[i,] - outputCodingAll[j,])) <= hashDiffThreshold):
                    print("The pair (",  str(i), ", ", str(j), ") ")
                    samePairList.append([i, j])
    return(np.array(samePairList))

#Thanks to Raibek
def bits_to_int(bits: np.ndarray) -> np.ndarray:
    """
    https://stackoverflow.com/a/59273656/11040577
    :param bits:
    :return:
    """
    assert len(bits.shape) == 2
    # number of columns is needed, not bits.size
    m, n = bits.shape
    # -1 reverses array of powers of 2 of same length as bits
    a = 2**np.arange(n)[::-1]
    # this matmult is the key line of code
    return bits @ a

#Thanks to Raibek
def find_pairs_by_threshold_fast(
        coding_all_bits: np.ndarray,
        listLength=30,
        hashDiffThreshold=0
) -> np.ndarray:

    xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)

    # counting number of differences
    diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
    for i in range(1, listLength):
        diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)

    same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

    # filtering out diagonal values
    same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]

    # filtering out duplicates above diagonal
    same_pairs.sort(axis=1)
    same_pairs = np.unique(same_pairs, axis=0)

    return same_pairs


start = time.time()
outResult1 = myWay()
print("My way")
print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
print(outResult1)
end = time.time()
timeUsedOld = end - start
print(timeUsedOld)


start = time.time()
print('Helper Way updated')
print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
outputCodingAll_bits = bits_to_int(outputCodingAll)
same_pairs_fast = find_pairs_by_threshold_fast(outputCodingAll_bits, 30, hashDiffThreshold)
print(same_pairs_fast)
end = time.time()
timeUsedNew = end - start
print(timeUsedNew)

print(type(outResult1))
print(type(same_pairs_fast))

if ((outResult1 == same_pairs_fast).all()) & (timeUsedNew < timeUsedOld):
    print("The two methods have returned the same results, I have been outsmarted !")
    print("The faster method used ", timeUsedNew, " while the old method takes ", timeUsedOld)
else:
    raiseExceptions("Error, two methods do not return the same results, something must be wrong")


#Thanks to Raibek
#note this suffers from out of memoery problem
# def Helper1Way():
    # outer_not_equal = np.not_equal.outer(outputCodingAll, outputCodingAll)

    # diff_count_matrix = outer_not_equal.sum((1, 3)) // outputCodingAll.shape[1]

    # samePairNumpy = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

    # # filtering out diagonal values
    # samePairNumpy = samePairNumpy[samePairNumpy[:, 0] != samePairNumpy[:, 1]]

    # # filtering out duplicates above diagonal
    # samePairNumpy.sort(axis=1)
    # samePairNumpy = np.unique(samePairNumpy, axis=0)
    # return(np.array(samePairNumpy))

# start = time.time()
# outResult2 = Helper1Way()
# print('Helper Way')
# print("Following pairs are considered the same given the threshold ", hashDiffThreshold)
# print(outResult2)
# end = time.time()
# print(end - start)

i think it wont be solved in 0(N), BUT WHAT YOU CAN DO IS , sort the array and then check the pairs having differences under the threshold value, if i crosses thrersold than remove it. worst case scenario it will be O(N*N) — sahasrara62, Oct 27 '22 at 05:20
First, specifying O(n) is a bit silly because a list of n codes can yield (n^2 - n) / 2 pairs. Without restrictions on the input, no algorithm can be O(n). Second, @sahasrara62 is right, but given (if I'm reading your explanation correctly) the codes are a fixed number of bits, you can sort in O(n) time with radix sort. Third, stop using lists and make each code a single `int`. Even so, a Python radix sort might be slow despite that it's O(n). Sorting 10k 30-bit ints will be a few millis with Python's built-in sort. If you need faster, switch languages. — Gene, Oct 27 '22 at 05:41

craigb · Answer 1 · 2022-10-27T06:02:39.673

If you only need 30-bit vectors, it would be much better to represent then as 30 bits in a 32-bit integer. Then the Hamming distance between two "vectors" is just the number of bits in the xor of the two integers. There are efficient algorithms for computing the number of non-zero bits in an integer. Those can be readily vectorized using numpy.

So the algorithm is:

generate HashCodingSize random integers between 0 and (1<<30)-1. That's one line with numpy.random.randint()
for each value xor it with the array (see numpy.bitwise_xor), compute the number of bits in each xor output value (vectorize one of the bit count algorithms), and find the indices whose bit count is less than or equal to hashDiffThreshold

This is still O(n^2), but is just a single loop in python; each operation in the loop operates on a length-n vector with numpy calls.

Nick · Answer 2 · 2022-10-27T08:32:08.010

As long as your listLength is within the size of an integer on your computer, I would use integers instead. Then you can xor the values (using broadcasting to xor all values against each other at once) to get the number of bits that are different, sum those bits and then use nonzero to find indexes that fit the requirement hash difference requirement. For example:

import numpy as np
import random

HashCodingSize = 10
listLength = 30
outputCodingAll = np.array([random.choice(range(2**listLength)) for _ in range(HashCodingSize)])
# sample result
# array([995834408, 173548139, 717311089,  87822983, 813938401, 
#        363814224, 970707528, 907497995, 337492435, 361696322])

distance = bit_count(outputCodingAll[:, np.newaxis] ^ outputCodingAll)
# sample result
# array([[ 0, 10, 15, 18, 14, 18,  8, 12, 18, 16],
#        [10,  0, 13, 14, 16, 24, 14, 14, 16, 18],
#        [15, 13,  0, 23, 13, 15, 15, 17, 19, 15],
#        [18, 14, 23,  0, 18, 16, 18, 12, 12, 14],
#        [14, 16, 13, 18,  0, 16, 12, 14, 14, 14],
#        [18, 24, 15, 16, 16,  0, 14, 16, 12,  6],
#        [ 8, 14, 15, 18, 12, 14,  0, 12, 18, 14],
#        [12, 14, 17, 12, 14, 16, 12,  0, 14, 14],
#        [18, 16, 19, 12, 14, 12, 18, 14,  0, 12],
#        [16, 18, 15, 14, 14,  6, 14, 14, 12,  0]], dtype=int32)

hashDiffThreshold = 10
samePairList = np.transpose(np.nonzero(distance < hashDiffThreshold))
# sample result
# array([[0, 0],
#        [0, 6],
#        [1, 1],
#        [2, 2],
#        [3, 3],
#        [4, 4],
#        [5, 5],
#        [5, 9],
#        [6, 0],
#        [6, 6],
#        [7, 7],
#        [8, 8],
#        [9, 5],
#        [9, 9]], dtype=int64)

Note the result repeats pairs (e.g. [5, 9] and [9, 5]) as they are all tested as both the first and second operand). It also includes each value tested against itself (which is obviously 0). These results can be easily filtered out if desired.

Note if you want to convert any of the values to lists of 1 and 0 you can format the numbers as binary strings of length listLength and map each character to an int e.g.

list(map(int, f'{outputCodingAll[0]:0{listLength}b}'))
# sample output
# [0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1]

This code uses the bit_count function from this answer:

def bit_count(arr):
    # Make the values type-agnostic (as long as it's integers)
    t = arr.dtype.type
    mask = t(-1)
    s55 = t(0x5555555555555555 & mask)  # Add more digits for 128bit support
    s33 = t(0x3333333333333333 & mask)
    s0F = t(0x0F0F0F0F0F0F0F0F & mask)
    s01 = t(0x0101010101010101 & mask)
    
    arr = arr - ((arr >> 1) & s55)
    arr = (arr & s33) + ((arr >> 2) & s33)
    arr = (arr + (arr >> 4)) & s0F
    return (arr * s01) >> (8 * (arr.itemsize - 1))

Raibek · Accepted Answer · 2022-11-16T05:00:18.953

This version utilizes bitwise operations on integers. The method of converting numpy binary represantations to ints is gotten from this answer https://stackoverflow.com/a/59273656/11040577.

Bench results show that the new method is much faster than the original one:

N = 1000, 0.194 secs VS 3.332 secs
N = 10000, 17.417 secs VS 338.628 secs

import random
import numpy as np
from time import perf_counter


def generate_codings(
        HashCodingSize=100,
        listLength=30
) -> np.ndarray:

    # Generating "HashCodingSize" of list with each element has size of 30
    outputCodingAllPy = []
    for seed in range(HashCodingSize):
        random.seed(seed)
        numZero = random.randint(1, listLength)
        numOne = listLength - numZero
        my_list = [0] * numZero + [1] * numOne
        random.shuffle(my_list)
        # print(my_list)
        outputCodingAllPy.append(my_list)
    # Covert to np array which is better than python3 list I suppose?
    outputCodingAll = np.asarray(outputCodingAllPy)
    return outputCodingAll


def find_pairs_by_threshold(
        coding_all: np.ndarray,
        hashDiffThreshold=0
) -> np.ndarray:

    loopRange = range(coding_all.shape[0])
    samePairList = []

    #This is O(n!) I suppose, is there better way ?
    for i in loopRange:
        for j in loopRange:
            if j > i:
                if (sum(abs(coding_all[i,] - coding_all[j,])) <= hashDiffThreshold):
                    # print("The pair (",  str(i), ", ", str(j), ") ")
                    samePairList.append([i, j])

    return np.array(samePairList)


def bits_to_int(bits: np.ndarray) -> np.ndarray:
    """
    https://stackoverflow.com/a/59273656/11040577
    :param bits:
    :return:
    """
    assert len(bits.shape) == 2
    # number of columns is needed, not bits.size
    m, n = bits.shape
    # -1 reverses array of powers of 2 of same length as bits
    a = 2**np.arange(n)[::-1]
    # this matmult is the key line of code
    return bits @ a


def find_pairs_by_threshold_fast(
        coding_all_bits: np.ndarray,
        listLength=30,
        hashDiffThreshold=0
) -> np.ndarray:

    xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)

    # counting number of differences
    diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
    for i in range(1, listLength):
        diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)

    same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

    # filtering out diagonal values
    same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]

    # filtering out duplicates above diagonal
    same_pairs.sort(axis=1)
    same_pairs = np.unique(same_pairs, axis=0)

    return same_pairs


if __name__ == "__main__":

    list_length = 30
    hash_diff_threshold = 0

    for hash_coding_size in (100, 1000, 10000):

        # let's generate samples
        output_coding_all = generate_codings(hash_coding_size, list_length)
        print("The N is", len(output_coding_all))

        # find_pairs_by_threshold bench
        start_time = perf_counter()
        same_pairs_etalon = find_pairs_by_threshold(output_coding_all, hash_diff_threshold)
        end_time = perf_counter()
        print(f"find_pairs_by_threshold() took {end_time-start_time} secs...")
        print("Following pairs are considered the same given the threshold ", same_pairs_etalon)

        # find_pairs_by_threshold_fast bench
        # first, we should convert binary representations to int
        start_time = perf_counter()
        output_coding_all_bits = bits_to_int(output_coding_all)
        end_time = perf_counter()
        print(f"it took {end_time-start_time} secs to convert numpy array binary to ints...")

        start_time = perf_counter()
        same_pairs_fast = find_pairs_by_threshold_fast(output_coding_all_bits, list_length, hash_diff_threshold)
        end_time = perf_counter()
        print(f"find_pairs_by_threshold_fast() took {end_time-start_time} secs...")

        # check if the results are the same
        print(f"Two lists of pairs found by different methods are identical: {(same_pairs_fast == same_pairs_etalon).all()}")

The first, extremely memory-consuming version:

outer_not_equal = np.not_equal.outer(outputCodingAll, outputCodingAll)

diff_count_matrix = outer_not_equal.sum((1, 3)) // outputCodingAll.shape[1]

samePairNumpy = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

# filtering out diagonal values
samePairNumpy = samePairNumpy[samePairNumpy[:, 0] != samePairNumpy[:, 1]]

# filtering out duplicates above diagonal
samePairNumpy.sort(axis=1)
samePairNumpy = np.unique(samePairNumpy, axis=0)

Update on tackling memory shortage

This version iterates slices of 'slice_size' with concatenating the results of all iterations in the end.

For example, if 'numpy.core._exceptions._ArrayMemoryError' occurs on N=100,000 then you can play with 'slice_size=1000', 'slice_size=10000' or other slice sizes until it works best for you in your current environment.

def find_pairs_by_threshold_fast_v2(
        coding_all_bits: np.ndarray,
        listLength=30,
        hashDiffThreshold=0,
        slice_size=None
) -> np.ndarray:

    if slice_size is None:

        xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)

        # counting number of differences
        diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
        for i in range(1, listLength):
            diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2 ** i), i)

        same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

    else:

        same_pairs_list = []
    
        for slice_starts in range(0, len(coding_all_bits), slice_size):
    
            xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)
    
            # counting number of differences
            diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
            for i in range(1, listLength):
                diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)
    
            same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
    
            same_pairs[:, 0] += slice_starts
    
            same_pairs_list.append(same_pairs)
    
        same_pairs = np.concatenate(same_pairs_list)

    # filtering out diagonal values
    same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]

    # filtering out duplicates above diagonal
    same_pairs.sort(axis=1)
    same_pairs = np.unique(same_pairs, axis=0)

    return same_pairs

Edit:
Clarifying how number of differences is counted in 'diff_count_matrix' variable
The number of differences for each hash pair in 'xor_outer_matrix' is the number of '1' bits in binary representation.
In order to count the number of '1' bits in each int of 'xor_outer_matrix' we utilize bitwise operations as in the further example.

Let's say we have the number of 41 as a 8-bit int for the sake of simplicity.

The 8-bit binary represantation of 41 is 00101001.

Now, we can count the number of ones 'ones_count' this way:

ones_count = 0
(00101001) & (00000001) = 00000001, which is the binary represantation of 1.
So, ones_count = 0 + 1 = 1.
i = 1, 2**i = 2. The binary represantation of 2 is 00000010.
(00101001) & (00000010) = 00000000.
right_shift(00000000, i) = 00000000.
So, ones_count = 1 + 0 = 1.
i = 2, 2**2 = 4. The binary represantation of 4 is 00000100.
(00101001) & (00000100) = 00000000.
right_shift(00000000, i) = 00000000.
So, ones_count = 1 + 0 = 1.
i = 3, 2**3 = 8. The binary represantation of 8 is 00001000.
(00101001) & (00001000) = 00001000.
right_shift(00001000, i) = 00000001.
So, ones_count = 1 + 1 = 2.
i = 4, 2**4 = 16. The binary represantation of 16 is 00010000.
(00101001) & (00010000) = 00000000.
right_shift(00000000, i) = 00000000.
So, ones_count = 2 + 0 = 2.
i = 5, 2**5 = 32. The binary represantation of 32 is 00100000.
(00101001) & (00100000) = 00100000.
right_shift(00100000, i) = 00000001.
So, ones_count = 2 + 1 = 3.
i = 6, 2**6 = 64. The binary represantation of 64 is 01000000.
(00101001) & (01000000) = 00000000.
right_shift(00000000, i) = 00000000.
So, ones_count = 3 + 0 = 3.
i = 7, 2**7 = 128. The binary represantation of 128 is 10000000.
(00101001) & (10000000) = 00000000.
right_shift(00000000, i) = 00000000.
So, ones_count = 3 + 0 = 3.

So, finally we found that the number of ones in the binary representation of 41 is 3.

Many thanks for this solution, but when I make HashCodingSize = 10000, there is an out of memory error as : numpy.core._exceptions._ArrayMemoryError: Unable to allocate 83.8 GiB for an array with shape (10000, 30, 10000, 30) and data type bool. Any idea how can we fix this ? — CuteMeowMeow, Oct 28 '22 at 02:58
Yes, the solution turns out to be extremely memory-consuming:) Have you tried other options with bit representations mentioned here? If they don't work for you, I'd be happy to develop other ways out based on some ideas I have. — Raibek, Oct 28 '22 at 03:13
Many Thanks for reply. Yes, I have been trying to convert 30 bits into integers first and then try some sort of XOR tech (not clear what this is), and bench marking them. — CuteMeowMeow, Oct 28 '22 at 03:23
Well, so the problem is still on the market:) Good, I'm developing the ideas then and I will share asap. — Raibek, Oct 28 '22 at 05:26
Many thanks. This solution is indeed much faster and has incoporated most of things have been mentioned in this answer and resolved the issue of out of memory. I have accepted this answer. I have to spend some time to understand the program in more depth though. — CuteMeowMeow, Oct 29 '22 at 03:13
Hi, thanks for the answer, but if I increase the size to 100000, the solution still has problem of numpy.core._exceptions._ArrayMemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type int64 — CuteMeowMeow, Oct 29 '22 at 07:36
Updated with a "naive" approach. I'd recommend you to bench with various 'slice_size' in order to find the best for big N (e.g.100000 and beyond). — Raibek, Oct 29 '22 at 09:34
Thanks, could you explain a little bit more about diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1) for i in range(1, listLength): diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2 ** i), i) , as one may not be clear about why bitwise_and being used to compare xor_outer to number 1 ? — CuteMeowMeow, Nov 15 '22 at 04:11
Many thanks. If I understand this correctly, the code is simply count how many 1s in each XOR differences, right? I think an alternatively way may be convert the integer representation of XOR into binary representation, and then simply sum them up, but this would require Python to store a large matrix which may then put lot pressure on RAM, then I understand why you use binary property to count number of 1s, thanks. — CuteMeowMeow, Nov 21 '22 at 07:06

CuteMeowMeow · Answer 4 · 2022-12-28T05:40:03.217

I decide to finalise this question by answering it after I have exploited and implemented @Raibek 's great answer in my project. Also easier for bot like chatGPT for their future training (smiling)...

In short, In addition to Raibek's answer, I have written my own version of convert 10-base number to any base digits in both of single number or in vector or matrix to facilitate my understanding. It returns the same results as the function provided by Raibek. I also write an alternative version of Raibek's answer, though it returns the same result, it is much slower, so it is for the purpose of understanding the solution.

Additionally I wrote an alterantive answer, rather than count how many differences in 1s in two sequence of 30 bits, but to compare the absolute differences between the two numbers represented by two sequences of 30 bits. Though there is no clearly evidence why I need to do this, but consider following scenario, if first pair is 100001 and 000001, and the second pair is 000011 and 000001, both pair would seem to only have one different 1, but if you consider this as a binary representation, then the diference in first pair would be much larger than the second pair, given a threshold is presentend then it might not be reasonable to say both pairs can be considered as a same group. However, this can be arguable as no one tells us that this 30 bits hash code has to be a binary representation (i.e., it can be viewd just a normal sequence). Also when we set threshold = 0, both algorithem would return the same pairs (I have verified this). When we change the value of threshold, then the accepted answer returns pairs of sequence with number of different 1 lower than the threshold, while my provided answer would return pair of sequence whose represented value in binary lower than the threshold. What should be used in practice depends in context in this case, so I decide to provide the alternative algorithem here for future reference as well:

Raibek's answer (same as he has provided):

#Original method
def find_pairs_by_threshold_fast_v2(
        coding_all_bits: np.ndarray,
        listLength=30,
        hashDiffThreshold=0,
        slice_size=None
) -> np.ndarray:

    if slice_size is None:

        xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)

        # counting number of differences
        diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
        for i in range(1, listLength):
            diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2 ** i), i)

        same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))

    else:

        same_pairs_list = []
    
        for slice_starts in range(0, len(coding_all_bits), slice_size):
    
            xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)
    
            # counting number of differences
            diff_count_matrix = np.bitwise_and(xor_outer_matrix, 1)
            for i in range(1, listLength):
                diff_count_matrix += np.right_shift(np.bitwise_and(xor_outer_matrix, 2**i), i)
    
            same_pairs = np.transpose(np.where(diff_count_matrix <= hashDiffThreshold))
    
            same_pairs[:, 0] += slice_starts
    
            same_pairs_list.append(same_pairs)
    
        same_pairs = np.concatenate(same_pairs_list)

    # filtering out diagonal values
    same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]

    # filtering out duplicates above diagonal
    same_pairs.sort(axis=1)
    same_pairs = np.unique(same_pairs, axis=0)

    return same_pairs

Rather than counting on number of differences in 1s, we will use the integers that are represented by those 30 bits, i.e., alternative methods but also based on Rabek's answer;

def find_pairs_by_threshold_fast_v2_alt(
        coding_all_bits: np.ndarray,
        listLength=30,
        hashDiffThreshold=0,
        slice_size=None
) -> np.ndarray:

    if slice_size is None:
        #https://numpy.org/doc/stable/reference/generated/numpy.ufunc.outer.html
        #np.ufunc.outer means to run the function on all pairs of A and B
        #so below simply means compute the xor betweeen all paris of coding list 
        #just the same as what I have done using for i in range(lenA), for j in range(lenB) etc..
        #bitwise_xor returns the value represented by binary 
        #you could use binary_repr to represent value in binary instead (note for binary_repr it does not have .outer so you may not use pair-wise in this case)
        print("coding_all_bits is \n", coding_all_bits)
        # Directly calculate differences between two elements and return the absolute value 
        xor_outer_matrix = np.absolute(np.subtract.outer(coding_all_bits, coding_all_bits))
        # xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
        print("xor_outer_matrix is \n", xor_outer_matrix)

        same_pairs = np.transpose(np.where(xor_outer_matrix <= hashDiffThreshold))

    else:

        same_pairs_list = []
    
        for slice_starts in range(0, len(coding_all_bits), slice_size):
    
            # xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)
            xor_outer_matrix = np.absolute(np.subtract.outer(coding_all_bits, coding_all_bits))
    
            same_pairs = np.transpose(np.where(xor_outer_matrix <= hashDiffThreshold))
    
            same_pairs[:, 0] += slice_starts
    
            same_pairs_list.append(same_pairs)
    
        same_pairs = np.concatenate(same_pairs_list)

    # filtering out diagonal values
    same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]

    # filtering out duplicates above diagonal
    same_pairs.sort(axis=1)
    same_pairs = np.unique(same_pairs, axis=0)

    return same_pairs

Following are my exploit about convert integer into bits, or bits into integer, it is not decent or not even close, but may be useful to new programmers like me who is wish to get familar to bits representation etc...

The bits converting program provided by other stackoverflower:

def bits_to_int(bits: np.ndarray) -> np.ndarray:
    """
    https://stackoverflow.com/a/59273656/11040577
    :param bits:
    :return:
    """
    assert len(bits.shape) == 2
    # number of columns is needed, not bits.size
    m, n = bits.shape
    # -1 reverses array of powers of 2 of same length as bits
    a = 2**np.arange(n)[::-1]
    # this matmult is the key line of code
    return bits @ a

Following are my explore, start from converting single number to convert a matrix of numbers...

def ConvertIntToBits(IntValue, base):
    # When integer is 0 there is no way to convert it into bits
    if IntValue != 0:
        num_binaray = math.floor(math.log(IntValue, base) + 1)
        print("we need", num_binaray,"digits for value", IntValue, "on base", base)
        powerList = np.arange(num_binaray-1, -1, -1)
        # print(powerList)
        rawIntValue = IntValue
        bitResult = []
        # print(range(len(powerList)))
        for i in range(len(powerList)):
            bitsValue = math.floor(rawIntValue/(base**(powerList[i])))
            # print("powerList[i]:", powerList[i])
            # print("bitsValue:", bitsValue)
            rawIntValue = rawIntValue - bitsValue * (base**powerList[i])
            # print("rawIntValue:", rawIntValue)
            bitResult.append(bitsValue)
        # bitResult = bitResult
        # print(bitResult)
    elif IntValue == 0:
        bitResult = [0]
    return(bitResult)

# base2 = ConvertIntToBits(IntValue=125, base=2)
# base10 = ConvertIntToBits(IntValue=125, base=10)

# print("base10: ", base10)

# ConvertIntToBits(IntValue=96, base=2)
# ConvertIntToBits(IntValue=100, base=7)

#Next convert bits back to integer 
#note this does not accept the list of list
def ConvertBitsIntToInt(IntBits, base):
    num_binaray = len(IntBits)
    print("we have", num_binaray,"digits for bits", IntBits, "on base", base)
    powerList = np.arange(num_binaray-1, -1, -1)
    # print(powerList)
    IntValue = sum(IntBits * base**powerList)
    print(IntValue)
    return(IntValue)

# for testValue in [1, 100, 200, 60, 70, 8]:
#     for baseValue in [2, 3, 4, 5]:
#         IntBitsSammple = ConvertIntToBits(IntValue=testValue, base=baseValue)
#         ConvertBitsIntToInt(IntBitsSammple, base=baseValue)

#Think about what to do if np array has arrays which have different length of list
#When the list inside has different lengths, we could add 0 in front to make them have the same length
#this is becuase in different base system, 0 * base^n would still be 0 no matter what you do
def ConvertBitsListToIntList(IntBitsList, base):

    if isinstance(IntBitsList, (np.ndarray)):
        print("Our input are already np arrays")
        IntBitsArray = IntBitsList
    else:
        print("input is not np array, so we are converting")
        # paddling (i.e., part of number would have digits less than others, 
        # we paddling them by adding 0 in front of them without changing the original number)
        pad = len(max(IntBitsList, key=len))
        IntBitsListPad = np.array([[0]*(pad-len(i)) + i for i in IntBitsList])
        IntBitsArray = np.asarray(IntBitsListPad)
    
    print(IntBitsArray)
    shape_binaray = IntBitsArray.shape
    num_binaray = shape_binaray[1]
    length_binary = shape_binaray[0]
    print("we have", num_binaray, "digits for each bit and in total ", length_binary, " bits from", IntBitsArray, "on base", base)
    powerList = np.asarray([np.arange(num_binaray-1, -1, -1)] * length_binary)
    # print(powerList)
    IntValueList = np.sum(IntBitsArray * base**powerList, axis=1)
    #Convert np array back to list (it is better to convert it to list outside the function)
    IntValueList.tolist()
    # print(IntValueList)
    return(IntValueList)

def ConvertIntListToBitsList(IntList, base):
    if isinstance(IntList, (np.ndarray)):
        print("Our input are already np arrays")
        IntArray = IntList
    else:
        print("input is not np array, so we are converting")
        IntArray = np.asarray(IntList)

    # print(IntArray)
    bitFinal = []
    for intValue in IntArray:
        bitsResults = ConvertIntToBits(intValue, base)
        bitFinal.append(bitsResults)

    # bitFinal = np.asarray(bitFinal, dtype=object)
    # print(bitFinal)
    return(bitFinal)

# Convert a matrix of ints to a matrix of bits
def ConvertIntMatrixToBitsMatrix(intMat, base, returnType="bitsList"):
    if isinstance(intMat, (np.ndarray)):
        print("Our input are already np arrays")
        IntArray = intMat
    else:
        print("input is not np array, so we are converting")
        IntArray = np.asarray(intMat)
    ArrayShape = IntArray.shape
    print("The shape of our input is", ArrayShape)
    #return a list with converted bits 
    bitFinal = []
    bitFinalMatrix = np.empty((ArrayShape[0],ArrayShape[1]))
    for i in range(ArrayShape[0]):
        for j in range(ArrayShape[1]):
    # for i in range(2):
    #     for j in range(2):
            # print(IntArray[i, j])
            # print(ConvertIntToBits(IntArray[i, j], base))
            # below return the bits 
 
            # below return the sum 
            ConvertedBits = ConvertIntToBits(IntArray[i, j], base)
            # Return a list with converted bits 
            bitFinal.append(ConvertedBits)
            # Return a matrix with sumed 1s 
            bitFinalMatrix[i, j] = sum(ConvertedBits)
    if returnType == "bitsList":
        rstMatrix = bitFinal
    elif returnType == "NumOnesMatrix":
        rstMatrix = bitFinalMatrix
    return(rstMatrix)

print("An example of ConvertIntListToBitsList: ")
print(ConvertIntListToBitsList([4, 8, 9], 2))
print("An example of ConvertIntMatrixToBitsMatrix: ")
# print(ConvertIntMatrixToBitsMatrix([[4, 8, 9], [2, 3, 1]], 2))
#The problem is how we deal with 0 
print(ConvertIntMatrixToBitsMatrix([[0, 8, 9], [2, 3, 1]], 2, "bitsList"))

#note for base 10, you can use 0-9 to represent number 
#for base 5, you can use 0-5 
#for base 7, you can use 0-6
testBase = 2
test1 = ConvertIntToBits(IntValue=19, base=testBase)
test2 = ConvertIntToBits(IntValue=15, base=testBase)
test3 = ConvertIntToBits(IntValue=50, base=testBase)
test4 = ConvertIntToBits(IntValue=41, base=testBase)
print("test1 is ", test1)
print("test2 is ", test2)
print("test3 is ", test3)
print("test4 is ", test4)

print(ConvertBitsListToIntList([test1, test2, test3], testBase))

print(ConvertIntListToBitsList(IntList=[19, 15, 50], base=testBase))

#See whether it works for the outputCodingAll (it worked, double check)
myConvert = ConvertBitsListToIntList(outputCodingAll, testBase)
onlineCovert = bits_to_int(outputCodingAll)

if myConvert.all() == onlineCovert.all():
    print("My way is the same as the online way")
else:
    print("My way is different from online way")

Finally, a slight modification of Rabeik's answer aims to understand what does his code do, but this runs much slower, i.e., "an alternatively way may be convert the integer representation of XOR into binary representation, and then simply sum them up, but this would require Python to store a large matrix which may then put lot pressure on RAM," :

def find_pairs_by_threshold_fast_v2_branch1(
        coding_all_bits: np.ndarray,
        listLength=30,
        hashDiffThreshold=0,
        slice_size=None
) -> np.ndarray:

    if slice_size is None:
        #https://numpy.org/doc/stable/reference/generated/numpy.ufunc.outer.html
        #np.ufunc.outer means to run the function on all pairs of A and B
        #so below simply means compute the xor betweeen all paris of coding list 
        #just the same as what I have done using for i in range(lenA), for j in range(lenB) etc..
        #bitwise_xor returns the value represented by binary 
        #you could use binary_repr to represent value in binary instead (note for binary_repr it does not have .outer so you may not use pair-wise in this case)
        xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits, coding_all_bits)
        # print("xor_outer_matrix is \n", xor_outer_matrix)
        # print(np.binary_repr(1052745519))
        # print(ConvertIntListToBitsList([1052745519], 2))
        # let's try convert xor_outer_matrix to bits and sum them 
        """
        Convert the difference matrix of XOR into binary represenation and store in a matrix and sum them up might be an alternative way
        but this may require a lot of RAM, but for the purpose of understanding of integers and bits, let's try this appoarch as well. 
        """
        # xor_outer_matrix_bits = bits_to_int(xor_outer_matrix)
        # The reason it does not work in the first place is we haven't dealt with 0 in base=2
        xor_outer_matrix_bits = ConvertIntMatrixToBitsMatrix(xor_outer_matrix, base=2, returnType="NumOnesMatrix")
        same_pairs = np.transpose(np.where(xor_outer_matrix_bits <= hashDiffThreshold))

    else:

        same_pairs_list = []
    
        for slice_starts in range(0, len(coding_all_bits), slice_size):
    
            xor_outer_matrix = np.bitwise_xor.outer(coding_all_bits[slice_starts: slice_starts+slice_size], coding_all_bits)
    
            # counting number of differences
            xor_outer_matrix_bits = ConvertIntMatrixToBitsMatrix(xor_outer_matrix, base=2, returnType="NumOnesMatrix")
            same_pairs = np.transpose(np.where(xor_outer_matrix_bits <= hashDiffThreshold))
    
            same_pairs[:, 0] += slice_starts
    
            same_pairs_list.append(same_pairs)
    
        same_pairs = np.concatenate(same_pairs_list)

    # filtering out diagonal values
    same_pairs = same_pairs[same_pairs[:, 0] != same_pairs[:, 1]]

    # filtering out duplicates above diagonal
    same_pairs.sort(axis=1)
    same_pairs = np.unique(same_pairs, axis=0)

    return same_pairs

Hope this helps.

Fastest way to identify all pairs of lists that their difference is lower than a given threshold when the overall list is very long (10000)

4 Answers4