4

I have a list (let's call it all_my_arrays) that contains about 48,000 1D arrays. I'd like to know how many duplicate arrays are in this list, if any. However, I'd like to exclude empty arrays (because I have multiple empty arrays within the list and don't want those factored into my duplicate count). I tried this code below, but it is taking too long:

import numpy as np
uniques=[]
for arr in all_my_arrays:
    if not np.array_equal(np.array([]), arr):
        if not any(np.array_equal(arr, unique_arr) for unique_arr in uniques):
           uniques.append(arr)
print(len(uniques)) #number of non-duplicates

Is there a much quicker way to accomplish this?

curious_cosmo
  • 1,184
  • 1
  • 18
  • 36
  • Are these arrays all the same size when they are non-empty? – jpp Sep 05 '18 at 23:57
  • Maybe you could sort `all_my_arrays` or a shallow copy of it. Then you only have to compare each array with the previous one to eliminate duplicates. – Michael Butscher Sep 05 '18 at 23:57
  • Yes, the non-empty 1D arrays are all the same size (length of 7). – curious_cosmo Sep 05 '18 at 23:59
  • 2
    *I have a list...* A Python list of numpy arrays? What comprises a duplicate? Is `array([1, 2, 3])` the same as `array([3, 2, 1])`? – dawg Sep 06 '18 at 00:34
  • @dawg 1st question: Yes, see the tags. 2nd question: No, duplicate arrays are those which return `True` to `np.array_equal(array1, array2)`, which requires them to be the same length and that every element `array1[i]` and `array2[i]` is the same for each `i` within the range of length. – curious_cosmo Sep 06 '18 at 00:42
  • @curious_cosmo have you solved this? If so, I think you should mark it as solved so people don't keep posting answers – Henry Woody Sep 06 '18 at 17:23
  • @HenryWoody Not yet, I'm still trying these different methods – curious_cosmo Sep 06 '18 at 17:41

4 Answers4

2

You can use the set type to get the unique values in your list. First you have to convert the arrays to hashable types (tuple here is good). Here's an example:

uniques = set(tuple(arr) for arr in all_my_arrays if arr.size > 0)

The set uniques will contain all the unique, non-empty arrays from your original all_my_arrays list. The contents of uniques are tuples, but you can convert them back to arrays with a list comprehension. If you're only interested in the number of unique arrays, then you can just call len(uniques) and not worry about converting back to arrays.

This approach has time complexity O(n + m) where n is the number of arrays and m is the length of each. There is however the overhead of converting to tuples, but I believe this method should be faster than what you have so far (which has time complexity O(n^2)) especially for such a large number of arrays.

Edit: To make this a bit faster, you can remove the empty check on each element and then just handle that at the end. Here's what that would look like:

uniques = set(tuple(arr) for arr in all_my_arrays)
num_unique = len(uniques) if () not in uniques else len(uniques) - 1
Henry Woody
  • 14,024
  • 7
  • 39
  • 56
  • Thanks, this ran very fast! Does this method preserve order (not required for the purposes of my question, just wondering)? – curious_cosmo Sep 06 '18 at 00:35
  • @curious_cosmo a set is an unordered collection, so it does not preserve order. You could check out [collections.OrderedDict](https://docs.python.org/2/library/collections.html#collections.OrderedDict) (putting tuples as the keys) or [this implementation of OrderedSet](https://code.activestate.com/recipes/576694/) if you're interested in preserving order. – Henry Woody Sep 06 '18 at 00:56
  • @Alexander the example you've just pointed out (`set(np.matrix([1,2,1,2]))`) is quite different from the solution I have posted. Yours converts the matrix `[1 2 1 2]` into a set `{1, 2}`, while mine converts a comprehension of tuples into a set. So yours is one level beneath what I have described in my answer. A more appropriate example would be `set([np.matrix([1,2,1,2]), np.matrix([1,2,1,3])])`, which produces `{(1,2,1,2), (1,2,1,3)}`, which is actually in line with the question. – Henry Woody Sep 06 '18 at 01:06
  • You are correct. Sorry for the confusion and thanks for the detailed explanation. – Alexander Sep 06 '18 at 01:08
1

Just do

in_arr = np.array([i for i in all_my_arrays if i.size == 7])
uniques = np.unique(in_arr, axis = 0)
uniques_list = list(uniques)  # if you really want a list

EDIT: Beware that np.unique sorts internally, so order is not preserved. If you want to maintain order you'll probably need to build a speciality numba function.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
0

Here's a NumPy-based solution which may work for you. For simplicity, I'll use an example where arrays are either of length 3 or 0.

Note that count of duplicates as per your definition equals total number of arrays minus number of empty arrays minus number of unique non-empty arrays.

The last part may be most expensive, but we can use np.unique on a regular NumPy array for this:

L = [np.array([1, 2, 3]), np.array([]), np.array([3, 2, 1]),
     np.array([1, 3, 2]), np.array([1, 2, 3]), np.array([1, 3, 2]),
     np.array([]), np.array([]), np.array([1, 2, 3])]

empty_arr = {idx for idx, arr in enumerate(L) if arr.shape == (0,)}
empty_count = len(empty_arr)

A = np.array([arr for idx, arr in enumerate(L) if idx not in empty_arr])

unique_arr = np.unique(A, axis=0)

duplicates = len(L) - len(empty_arr) - len(unique_arr)

print(duplicates)  # 3
jpp
  • 159,742
  • 34
  • 281
  • 339
0

I am not sure that I am understanding your problem fully.

However, if I understand:

  1. You have a Python list of 7 element numpy arrays;
  2. You want to filter out the arrays that are all 0;
  3. You then want to get a unique count of what is left.

If that is correct, you can do this:

import numpy as np
li_of_arr=[np.random.choice(4,7) for _ in range(50000)] # example list of arrays
mat=np.array([e for e in li_of_arr if np.any(e)])   # create a numpy matrix of non zero arrays
uniq=np.unique(mat,axis=0)    # here are your unique sub arrays
print (len(mat), len(uniq))   # len of non zero elements and unique non zero elements

This takes less than 1 second on my MacBook.

dawg
  • 98,345
  • 23
  • 131
  • 206