-1

I have a tall (3,000,000 by 2) matrix, represented as a list of lists (A list with 3 million elements, each being a list with two elements) and I need to count the number of times each pair appears as a row (there's a finite number of possible pairs, around 5000). This is what I do so far, but it's highly inefficient:

for a in list1:
    for b in list2:
        count_here = tall_matrix.count([a,b])

Any ideas on how to make this quicker?

Thanks a lot!

  • You might not have access to the construction step, but a nice way is to keep the count when putting the elements into the container. – Right leg Jul 21 '17 at 12:35
  • could you add a small example of your data? – MattR Jul 21 '17 at 12:36
  • A hash will be a better option with key such as a1_a2 => count. a1 should be smaller than a2. Once done you are done, it is easy to use this to get counts. But to make this hash, it will still take O(n) time. – kawadhiya21 Jul 21 '17 at 12:37
  • Possible duplicate of [How can I count the occurrences of a list item in Python?](https://stackoverflow.com/questions/2600191/how-can-i-count-the-occurrences-of-a-list-item-in-python) – Mohd Jul 21 '17 at 12:46

2 Answers2

1

This is damned simple using collections.Counter. Since your list contains sub-lists, and sub-lists are not hashable, you'll need to convert them to tuples first:

In [280]: x = [[1, 2], [1, 2], [3, 4], [4, 5], [5, 6], [4, 5]]

In [282]: c = collections.Counter(map(tuple, x))

In [283]: c
Out[283]: Counter({(1, 2): 2, (3, 4): 1, (4, 5): 2, (5, 6): 1})

c stores the counts of every single pair in your list.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    @tobias_k A counter is a defaultdict of sorts: `In [284]: c[(100, 20352)]; Out[284]: 0` – cs95 Jul 21 '17 at 12:41
  • 1
    Well, in this case I learned something new today :-). Still, might add a hint w.r.t. converting the sub-list to tuple *for lookup*, as well. – tobias_k Jul 21 '17 at 12:44
  • @tobias_k Already mentioned it man :) `Since lists are not hashable, you'll need to convert them to tuples first` – cs95 Jul 21 '17 at 12:44
0

Counter should do the trick :

Test for performance (using IPython) :

In [1]: import random
In [2]: a=[(random.randint(0, 10), random.randint(0, 10)) for i in range(3000000)]
In [3]: from collections import Counter
In [4]: %time c = Counter(a)
CPU times: user 940 ms, sys: 52 ms, total: 992 ms
Wall time: 891 ms
Raphaël Braud
  • 1,489
  • 10
  • 14