0

I have a list of around 131000 arrays, each of length 300. I am using python I want to check which of the arrays are repeating in this list. I am trying this by comparing each array with others. like :

Import numpy as np
wordEmbeddings = [[0.8,0.4....upto 300 elements]....upto 131000 arrays]
count = 0
for i in range(0,len(wordEmbeddings)):
   for j in range(0,len(wordEmbeddings)):
      if i != j:
         if np.array_equal(wordEmbeddings[i],wordEmbeddings[j]):
         count += 1

this is running very slowly, It might take hours to finish, how can I do this efficiently ?

Harsh2093
  • 61
  • 1
  • 6

5 Answers5

5

You can use collections.Counter to count the frequency of each sub list

>>> from collections import Counter
>>> Counter(list(map(tuple, wordEmbeddings)))

We need to cast the sublist to tuples since list is unhashable i.e. it cannot be used as a key in dict.

This will give you result like this:

>>> Counter({(...4, 5, 6...): 1, (...1, 2, 3...): 1})

The key of Counter object here is the list and value is the number of times this list occurs. Next you can filter the resulting Counter object to only yield elements where value is > 1:

>>> items = Counter(list(map(tuple, wordEmbeddings)))
>>> list(filter(lambda x: items[x] > 1,items))

Timeit results:

$ python -m timeit -s "a = [range(300) for _ in range(131000)]" -s "from collections import Counter" "Counter(list(map(tuple, a)))"
10 loops, best of 3: 1.18 sec per loop
Sohaib Farooqi
  • 5,457
  • 4
  • 31
  • 43
1

You can remove duplicate comparisons by using

for i in range(0,len(wordEmbeddings)):
    for j in range(i,len(wordEmbeddings)):

You could look in to pypy for general purpose speed ups.
It might also be worth looking into hashing the arrays somehow.

Here's a question on the speeding up np array comparison. Do the order of the elements matter to you?

ConorSheehan1
  • 1,640
  • 1
  • 18
  • 30
1

You can use set and tuple to find duplicated arrays inside another array. Create a new list contains tuples, we use tuples because lists are unhashable type. And then filter new list with using set.

tuple = list(map(tuple, wordEmbeddings))
duplications = set([t for t in tuple if tuple.count(t) > 1])
print(duplications)
abdullahselek
  • 7,893
  • 3
  • 50
  • 40
0

maybe you can reduce the initial list to unique hashes, or non-unique sums, and go over the hashes first - which may be a faster way to compare elements

Evgeny
  • 4,173
  • 2
  • 19
  • 39
0

I suggest you first sort the list (might also be helpful for further processing) and then compare. The advantage is that you only need to compare every array element to the previous one:

import numpy as np
from functools import cmp_to_key
wordEmbeddings = [[0.8, 0.4, 0.3, 0.2], [0.2,0.3,0.7], [0.8, 0.4, 0.3, 0.2], [ 1.0, 3.0, 4.0, 5.0]]
def smaller (x,y):
    for i in range(min(len(x), len(y))):
        if x[i] < y[i]:
            return 1
        elif y[i] < x[i]:
            return -1
    if len(x) > len(y):
        return 1
    else:
        return -1
wordEmbeddings = sorted(wordEmbeddings, key=cmp_to_key(smaller))
print(wordEmbeddings)
# output: [[1.0, 3.0, 4.0, 5.0], [0.8, 0.4, 0.3, 0.2], [0.8, 0.4, 0.3, 0.2], [0.2, 0.3, 0.7]]
count = 0
for i in range(1, len(wordEmbeddings)):
    if (np.array_equal(wordEmbeddings[i], wordEmbeddings[i-1])):
        count += 1

print(count)
# output: 1

If N is the length of word embedding and n is the length of the inner array, then your approach was to do O(N*N*n) comparisons. When reducing the comparisons as in con--'s answer, then you still have O(N*N*n/2) comparisons.

Sorting will take O(N*log(N)*n) time and the subsequent step of counting only takes O(N*n) time which all in all is shorter than O(N*N*n/2)

FlyingTeller
  • 17,638
  • 3
  • 38
  • 53