Count the number of exact co-occurences of items in a numpy list

Question

I am trying to figure out the fastest way to count how many time two values are located one after the other in a numpy list.

For example:

list = [1, 5, 4, 1, 2, 4, 6, 7, 2, 1, 3, 3, 1, 2] and I want to count the number of times the value 1 follows the value 2 (but not vice versa)

In the example above, the answer should be 1 since 1 follows 2 only once.

I can obviously reach the answer with a simple for-loop that adds to a counter every time the item i is equal 1 and item i-1 equals 2, but I feel that there must be a faster way to do it,

Thanks

Note: don't use `list` as a variable name, as it is already a built in `dtype` in python — sacuL, Jul 05 '18 at 14:21
Related (more general): [Python/NumPy first occurrence of subarray](https://stackoverflow.com/questions/7100242/python-numpy-first-occurrence-of-subarray) — miradulo, Jul 05 '18 at 15:26
Your example is a bit unclear - do you have a NumPy array you are working with or a Python list that you think you can perform this operation faster on using NumPy? There's no such thing as "numpy lists". — miradulo, Jul 05 '18 at 15:40

sacuL · Accepted Answer · 2018-07-05T15:32:38.003

5

You coud do this using np.diff and np.where:

import numpy as np

mylist = [1, 5, 4, 1, 2, 4, 6, 7, 2, 1, 3, 3, 1, 2]

# Turn your list into a numpy array
myarray = np.array(mylist)

# find occurences where myarray is 2 and the following element is 2 minus 1
np.sum((myarray[:-1] == 2) & (np.diff(myarray) == -1))

Which returns 1

Timings on a large array:

On a small list, the time difference between an iterative method and numpy methods will not be noticeable. But on a large array, as in the example below, the performance of numpy is much better.

import timeit

mylist = np.random.choice(range(0,9), 1000000)

def np_method(mylist = mylist):
    return np.sum((mylist[:-1] == 2) & (np.diff(mylist) == -1))

def zip_loop(a = mylist):
    return len( [1 for i,j in zip(a, a[1:]) if i == 2 and j == 1] )

def for_loop(list1 = mylist):
    count=0
    desired_num=2
    follower_num=1
    for i in range(len(list1)-1):
        if list1[i]==desired_num:
            if list1[i+1]==follower_num:
                count+=1
    return count

>>> timeit.timeit(np_method, number = 100) / 100
0.006748438189970329

>>> timeit.timeit(zip_loop, number = 100) / 100
0.3811768989200209

>>> timeit.timeit(for_loop, number = 100) / 100
0.3774999916599336

edited Jul 05 '18 at 15:32

answered Jul 05 '18 at 14:20

sacuL

49,704
8
81
106

You can just use `sum` instead of `len(np.where(...))` if all you need is the number of occurences. – busybear Jul 05 '18 at 14:28
Thanks @busybear, yeah, that's nicer – sacuL Jul 05 '18 at 14:31
`np.sum((mylist[:-1] == 2) & (np.diff(mylist) == -1))` returns `0` no matter what input. – lenik Jul 05 '18 at 14:54
`mylist[:-1] == 2` needs to be changed to `np.array(mylist[:-1]) == 2` (which was in earlier edits) – scrpy Jul 05 '18 at 14:54
@scrpy, True... sorry about that! – sacuL Jul 05 '18 at 14:57
@lenik, fixed now, sorry :/ – sacuL Jul 05 '18 at 14:59
@sacul and your comment about timing is also not true, see below for timing results, `numpy` and `zip()` approaches work almost the same time, while `for` loop is almost twice as fast... – lenik Jul 05 '18 at 15:03
@sacul I have put the timing code in a separate answer, use it, run it, ponder over it. `numpy` is slow =) – lenik Jul 05 '18 at 15:10
@sacul you're missing `np.array(my_list)` from your answer in your timing code, what a cheat =( – lenik Jul 05 '18 at 15:14
1

@lenik They tagged the question with NumPy and say they have a "NumPy list". It seems reasonable to assume they intend on working with NumPy arrays. – miradulo Jul 05 '18 at 15:15
1

np.sum((mylist[:-1] == 2) & (mylist[:1] == 1)) is two times faster. – B. M. Jul 05 '18 at 15:23

Inder · Answer 2 · 2018-07-05T14:35:55.443

1

Thea easiest way i can think of is to use a for loop

count=0
desired_num=2
follower_num=1
for i in range(len(list1)-1):
    if list1[i]==desired_num:
        if list1[i+1]==follower_num:
            count+=1
print("total occurance=",count)

takes : 0.0003437995910644531s on my machine

edited Jul 05 '18 at 14:35

answered Jul 05 '18 at 14:30

Inder

3,711
9
27
42

This would work quickly on a small array, but on my machine, this took about 0.3618 seconds for a large array (1,000,000 elements) – sacuL Jul 05 '18 at 14:36
lozz I have i 7 16 GB ram your specs please?? – Inder Jul 05 '18 at 14:37
1

i5 8GB ram. My machine is getting kind of old, but looping through a large array will usually take significantly longer than `numpy` vectorized methods – sacuL Jul 05 '18 at 14:39
@Inder nice, twice as fast as the accepted answer. upvoted. – lenik Jul 05 '18 at 15:36

Laurent H. · Answer 3 · 2018-07-05T14:31:24.013

0

I may suggest you to use slicing and comprehension to iterate through your input list as follows:

myList = [1, 5, 4, 1, 2, 4, 6, 7, 2, 1, 3, 3, 1, 2]

result = sum(myList[i:i+2] == [2,1] for i in range(len(myList)-1))

print(result) # 1

Using the zip() function can also help you:

myList = [1, 5, 4, 1, 2, 4, 6, 7, 2, 1, 3, 3, 1, 2]

result = sum((i,j) == (2,1) for (i,j) in zip(myList, myList[1:]))

print(result) # 1

edited Jul 05 '18 at 14:31

answered Jul 05 '18 at 14:26

Laurent H.

6,316
1
18
40

you're generating very long lists with mostly zeros inside, this is not very efficient waste of memory. see below for the method to generate smaller lists, that contain only one element for every occasion we're looking for. – lenik Jul 05 '18 at 14:36

score -1 · Answer 4 · answered Jul 05 '18 at 14:24

You should not call your variable list -- it's already used in python and very confusing.

>>> a = [1, 5, 4, 1, 2, 4, 6, 7, 2, 1, 3, 3, 1, 2]
>>> len( [1 for i,j in zip(a, a[1:]) if i == 2 and j == 1] )
1

Basically, you may put your array over itself with zip() and deal with the number pairs, looking for any combinations:

>>> zip(a, a[1:])
[(1, 5), (5, 4), (4, 1), (1, 2), (2, 4), (4, 6), (6, 7), (7, 2), (2, 1), (1, 3), (3, 3), (3, 1), (1, 2)]

score -1 · Answer 5 · answered Jul 05 '18 at 15:00

-1

Just for the fun of it, I have timed all 4 main solutions, here are the results:

#!/usr/bin/env python

import numpy as np
import random

def f1(li):
    return np.sum((np.array(li[:-1]) == 2) & (np.diff(li) == -1))

def f2(li):
    return sum((i,j) == (2,1) for (i,j) in zip(li, li[1:]))

def f3(li):
    count=0
    desired_num=2
    follower_num=1
    for i in range(len(li)-1):
        if li[i]==desired_num:
        if li[i+1]==follower_num:
            count+=1
    return count    

def f4(li) :
    return len( [1 for i,j in zip(li, li[1:]) if i == 2 and j == 1] )

if __name__=='__main__':
    import timeit   
    import random
    s = []
    for i in range(10000000) :
        s.append( random.randint(1,10) )

    print f1(s), f2(s), f3(s), f4(s)

    print(f1(s)==f2(s)==f3(s)==f4(s))

    for f in (f1,f2,f3,f4):
        print("   {:^10s}{:.4f} secs".format(f.__name__, timeit.timeit("f(s)", setup="from __main__ import f, s", number=10)))

'''
output:

100236 100236 100236 100236
True
       f1    7.2285 secs
       f2    13.7680 secs
       f3    4.3167 secs
       f4    7.7375 secs

'''

Surprisingly, simple for loop beats numpy =)

answered Jul 05 '18 at 15:00

lenik

23,228
4
34
43

@sacul 10000000 is a tiny number ? – lenik Jul 05 '18 at 15:15
fyi, I'm not the downvoter (on either of your answers). You can be skeptical, that's fine. Let's agree to disagree :). OP can choose whichever answer they agree with. Sorry for the frustration – sacuL Jul 05 '18 at 15:23
@sacul i don't care about downvoters. i hoped your solution really works as fast as you say. unfortunately it loses to a simple `for` loop and about as fast as `zip()` with list comprehensions. what a pity. – lenik Jul 05 '18 at 15:27
1

@lenik You are benchmarking the creation of a massive NumPy array from a list. This is not _at all_ productive. I agree it is unclear from their question whether they are really using NumPy or not, but benchmarking the creation of NumPy arrays doesn't have any significance. – miradulo Jul 05 '18 at 15:29
@miradulo i use the code from the answers verbatim. zipping large lists is not a free lunch either =) – lenik Jul 05 '18 at 15:33

Count the number of exact co-occurences of items in a numpy list

5 Answers5