Breaking down the original problem to three parts: "(1) My arrays are really large, and (2) I have a lot of them, and (3) the probability of two arrays being equal is small"
All the solutions (to date) are focused on part (1) - optimizing the performance of each equality check, and some improve this performance by factor of 10. Points (2) and (3) are ignored. Comparing each pair has O(n^2) complexity, which can become huge for a lot of matrices, while needles as the probability of being duplicates is very small.
The check can become much faster with the following general algorithm -
- fast hash of each array O(n)
- check equality only for arrays with the same hash
A good hash is almost unique, so the number of keys can easily be a very large fraction of n. On average, number of arrays with the same hash will be very small, and almost 1 in some cases. Duplicate arrays will have the same hash, while having the same hash doesn't guarantee they are duplicates. In that sense, the algorithm will catch all the duplicates. Comparing images only with the same hash significantly reduces the number of comparisons, which becomes almost O(n)
For my problem, I had to check duplicates within ~1 million integer arrays, each with 10k elements. Optimizing only the array equality check (with @MB-F solution) estimated run time was 5 days. With hashing first it finished in minutes. (I used array sum as the hash, that was suited for my arrays characteristics)
Some psuedo-python code
def fast_hash(arr) -> int:
pass
def arrays_equal(arr1, arr2) -> bool:
pass
def make_hash_dict(array_stack, hush_fn=np.sum):
hash_dict = defaultdict(list)
hashes = np.squeeze(np.apply_over_axes(hush_fn, array_stack, range(1, array_stack.ndim)))
for idx, hash_val in enumerate(hashes):
hash_dict[hash_val].append(idx)
return hash_dict
def get_duplicate_sets(hash_dict, array_stack):
duplicate_sets = []
for hash_key, ind_list in hash_dict.items():
if len(ind_list) == 1:
continue
all_duplicates = []
for idx1 in range(len(ind_list)):
v1 = ind_list[idx1]
if v1 in all_duplicates:
continue
arr1 = array_stack[v1]
curr_duplicates = []
for idx2 in range(idx1+1, len(ind_list)):
v2 = ind_list[idx2]
arr2 = array_stack[v2]
if arrays_equal(arr1, arr2):
if len(curr_duplicates) == 0:
curr_duplicates.append(v1)
curr_duplicates.append(v2)
if len(curr_duplicates) > 0:
all_duplicates.extend(curr_duplicates)
duplicate_sets.append(curr_duplicates)
return duplicate_sets
The variable duplicate_sets
is a list of lists, each internal list contains indices of all the same duplicates.