How can I find the number of overlapping sequences in a String with Python?

Question

I have a long sequence, and I would like to know how often some sub-sequences occur in this sequence.

I know string.count(s, sub), but it only counts non-overlapping sequences.

Does a similar function which also counts overlapping sequences exist?

You might want to look at http://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches — Matthew Schinckel, Jul 27 '11 at 12:18
@moose Do you know the subsequences before to find their occurence's counts ? — eyquem, Jul 27 '11 at 13:35

score 10 · Accepted Answer · answered Jul 27 '11 at 13:03

10

As an alternative to writing your own search function, you could use the re module:

In [22]: import re

In [23]: haystack = 'abababa baba alibababa'

In [24]: needle = 'baba'

In [25]: matches = re.finditer(r'(?=(%s))' % re.escape(needle), haystack)

In [26]: print [m.start(1) for m in matches]
[1, 3, 8, 16, 18]

The above prints out the starting positions of all (potentially overlapping) matches.

If all you need is the count, the following should do the trick:

In [27]: len(re.findall(r'(?=(%s))' % re.escape(needle), haystack))
Out[27]: 5

answered Jul 27 '11 at 13:03

NPE

486,780
108
951
1,012

I'll vote for this one. It's much better than my own. A simple len(matches) will give you the result. – Bite code Jul 27 '11 at 13:06
one liner : return len( re.finditer(r'(?=(%s))' % re.escape(needle), haystack)) ... superb :) – DhruvPathak Jul 27 '11 at 13:21
@DhruvPathak ``TypeError: object of type 'callable-iterator' has no len()`` – eyquem Jul 27 '11 at 13:57

Bite code · Answer 2 · 2011-07-27T13:01:58.570

6

A simple to understand way to do it is:

def count(sub, string):
    count = 0
    for i in xrange(len(string)):
        if string[i:].startswith(sub):
            count += 1
    return count

count('baba', 'abababa baba alibababa')
#output: 5

If you like short snippets, you can make it less readable but smarter:

def count(subs, s):
    return sum((s[i:].startswith(subs) for i in xrange(len(s))))

This uses the fact that Python can treat boolean like integers.

edited Jul 27 '11 at 13:01

answered Jul 27 '11 at 12:54

Bite code

578,959
113
301
329

With ``Lsub = len(sub)`` and ``if string[i:i+Lsub]==sub:`` and a string of length 298, it runs in 1.17 seconds while your code runs in 4 seconds (10000 turns of iteration) - Also, **sum()** is slow and the function with **sum()** is certainly slower (I didn't test though) – eyquem Jul 27 '11 at 14:11
My first version used Lsub, but I sacrificed speed for confort with this new version as the OP has never stated he had a perf requirement. I never profiled `sum` so I can't swear it's slow (I can't see a reason why it would be), but if it is, itertools.reduce + operator.add would do the job just fine. – Bite code Jul 27 '11 at 14:24
I've learnt the slowness of **sum()** from the member _gnibbler_ in this thread: (http://stackoverflow.com/questions/3083829/how-do-i-interleave-strings-in-python/3083894#comment-8050679) (look in the comments). As you will see in these comments, it is the same reason that explains the slowness of **sum()** AND **reduce()** too – eyquem Jul 27 '11 at 14:40
That was my first assumption. But itertools.reduce() is not the same as reduce(): it doesn't have quadratic performance because it doesn't create a new list every time. All the functions in the itertools module use generators behind the scene. – Bite code Jul 28 '11 at 07:14
ahem... the problem is that there is no **reduce()** in the _itertools_ module in the Python 2.7.1 doc , nor in Python 3.2 doc. In Python 3.2, **reduce()** is present in the **functools()** module only and it seems to have the same functionning as before in Python 2.x that is to say ``For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5).`` In Python 2.x , **functools.reduce()** : _This is the same function as reduce(). It is made available in this module to allow writing code more forward-compatible with Python 3._ – eyquem Jul 29 '11 at 09:27
And even more: I made some benchmarks. Sum is indeed slower than a manual loop. Even using reduce and operator.add is. So +1 for you man. – Bite code Jul 29 '11 at 10:02
Python is a vast realm. The +1 is for _gnibbler_ that made me aware of this slowness of **sum()** and **reduce()**. I wonder why **sum()** is so slow: maybe its underlying implementation is written in Python, not in C (?) – eyquem Jul 29 '11 at 10:29
@e-satis I would really like to see those benchmarks. There is no way sum of numbers (or bools, which _are_ numbers:) would be slower than a manual loop. (@gnibbler was talking about sum of _lists_, and there they are completely right, of course.) – Veky Feb 05 '16 at 03:07

score 1 · Answer 3 · answered Jul 27 '11 at 12:21

1

This should help you :

matches =[]
st = 'abababa baba alibababa'
needle = 'baba'
for i in xrange(len(st)-len(needle)+1): 
   i = st.find(needle,i,i+len(needle))
   if(i >= 0):
     matches.append(st.find(needle,i,i+len(needle)))
print(str(matches))

see it here : http://codepad.org/pmkKXmWB

Did not benchmark it for long strings, see if its efficient enough for your use.

answered Jul 27 '11 at 12:21

DhruvPathak

42,059
16
116
175

Would there be an advantage to replacing the lines of the loop with ... new_st = st[i:]; if new_st.startswith(needle): matches.append(i) – jcfollower Jul 27 '11 at 12:46
Nope I guess, since i is already being used to check only particular chunk of the string st.find(needle,i,i+len(needle)) starting from i, so creating substring and assigning would be just an overhead . – DhruvPathak Jul 27 '11 at 12:48

score 0 · Answer 4 · answered Sep 08 '14 at 10:30

I learnt today that you could use a running index to fetch the next occurrence of your substring:

string = 'bobobobobobobob' # long string or variable here
count = 0
start = 0
while True:
    index = string.find('bob', start)
    if index >= 0:
        count += 1
        start += 1
    else:
        break
print(count)

Returns 7

How can I find the number of overlapping sequences in a String with Python?

4 Answers4

Linked

Related