String count with overlapping occurrences

Question

What's the best way to count the number of occurrences of a given string, including overlap in Python? This is one way:

def function(string, str_to_search_for):
      count = 0
      for x in xrange(len(string) - len(str_to_search_for) + 1):
           if string[x:x+len(str_to_search_for)] == str_to_search_for:
                count += 1
      return count


function('1011101111','11')

This method returns 5.

Is there a better way in Python?

The canonical is [Count number of occurrences of a substring in a string](/questions/8899905), which covers how to count both overlapping and non-overlapping occurrences. — Karl Knechtel, Jan 26 '23 at 18:28

Jochen Ritzel · Accepted Answer · 2012-04-21T19:50:18.403

94

Well, this might be faster since it does the comparing in C:

def occurrences(string, sub):
    count = start = 0
    while True:
        start = string.find(sub, start) + 1
        if start > 0:
            count+=1
        else:
            return count

edited Apr 21 '12 at 19:50

answered Jun 03 '10 at 23:35

Jochen Ritzel

104,512
31
200
194

score 55 · Answer 2 · edited Nov 20 '22 at 18:57

55

>>> import re
>>> text = '1011101111'
>>> len(re.findall('(?=11)', text))
5

If you didn't want to load the whole list of matches into memory, which would never be a problem! you could do this if you really wanted:

>>> sum(1 for _ in re.finditer('(?=11)', text))
5

As a function (re.escape makes sure the substring doesn't interfere with the regex):

def occurrences(text, sub):
    return len(re.findall('(?={0})'.format(re.escape(sub)), text))

>>> occurrences(text, '11')
5

edited Nov 20 '22 at 18:57

wjandrea

28,235
9
60
81

answered Jul 29 '12 at 02:04

jamylak

128,818
30
231
230

Could you clarify why it would never be a problem? In practice, if there were a lot of matches or the matched substrings were very large, it could cause a lot of memory usage, no? – wjandrea Nov 20 '22 at 19:00
1

@wjandrea I think maybe I should have said "which would probably never be a problem" because I provided both the `iter` and non `iter` solution. But in this case the text is already in memory so I thought that getting the list would be fine – jamylak Nov 21 '22 at 00:05

score 22 · Answer 3 · answered Aug 25 '16 at 22:50

22

You can also try using the new Python regex module, which supports overlapping matches.

import regex as re

def count_overlapping(text, search_for):
    return len(re.findall(search_for, text, overlapped=True))

count_overlapping('1011101111','11')  # 5

answered Aug 25 '16 at 22:50

David C

7,204
5
46
65

`import regex as re`? Isn't that confusing? Why not just `import regex`? – wjandrea Nov 20 '22 at 19:01
1

There's nothing wrong with `import regex` -- documentation shows that approach. `regex`, however, has all the same components as the standard library `re`, so I prefer writing `re.compile`, etc. in way that is familiar and concise. Also, most of the time I end up using `regex` I started with `re` and then found some use case I want to rely on `regex`. I can then update `import re` to `import regex as re` at the top of the file and not have to make other changes. – David C Nov 25 '22 at 14:31

score 14 · Answer 4 · edited May 23 '17 at 12:26

14

Python's str.count counts non-overlapping substrings:

In [3]: "ababa".count("aba")
Out[3]: 1

Here are a few ways to count overlapping sequences, I'm sure there are many more :)

Look-ahead regular expressions

How to find overlapping matches with a regexp?

In [10]: re.findall("a(?=ba)", "ababa")
Out[10]: ['a', 'a']

Generate all substrings

In [11]: data = "ababa"
In [17]: sum(1 for i in range(len(data)) if data.startswith("aba", i))
Out[17]: 2

edited May 23 '17 at 12:26

Community

1
1

answered Jan 14 '16 at 08:50

Dima Tisnek

11,241
4
68
120

3

More concise `sum(data.startswith("aba", i) for i, _ in enumerate(data))` :) – Dima Tisnek Jun 08 '20 at 00:35

score 4 · Answer 5 · edited Nov 20 '18 at 05:51

4

def count_substring(string, sub_string):
    count = 0
    for pos in range(len(string)):
        if string[pos:].startswith(sub_string):
            count += 1
    return count

This could be the easiest way.

edited Nov 20 '18 at 05:51

Graham

3,153
3
16
31

answered May 12 '18 at 04:01

Arun Tom

789
6
8

score 4 · Answer 6 · edited Oct 24 '20 at 11:41

A fairly pythonic way would be to use list comprehension here, although it probably wouldn't be the most efficient.

sequence = 'abaaadcaaaa'
substr = 'aa'

counts = sum([
    sequence.startswith(substr, i) for i in range(len(sequence))
])
print(counts)  # 5

The list would be [False, False, True, False, False, False, True, True, False, False] as it checks all indexes through the string, and because int(True) == 1, sum gives us the total number of matches.

score 3 · Answer 7 · answered Mar 07 '15 at 13:49

3

s = "bobobob"
sub = "bob"
ln = len(sub)
print(sum(sub == s[i:i+ln] for i in xrange(len(s)-(ln-1))))

answered Mar 07 '15 at 13:49

Padraic Cunningham

176,452
29
245
321

score 3 · Answer 8 · edited Jun 20 '20 at 09:12

How to find a pattern in another string with overlapping

This function (another solution!) receive a pattern and a text. Returns a list with all the substring located in the and their positions.

def occurrences(pattern, text):
    """
    input: search a pattern (regular expression) in a text
    returns: a list of substrings and their positions 
    """
    p = re.compile('(?=({0}))'.format(pattern))
    matches = re.finditer(p, text)
    return [(match.group(1), match.start()) for match in matches]

print (occurrences('ana', 'banana'))
print (occurrences('.ana', 'Banana-fana fo-fana'))

[('ana', 1), ('ana', 3)]
[('Bana', 0), ('nana', 2), ('fana', 7), ('fana', 15)]

score 2 · Answer 9 · answered Jan 22 '15 at 22:30

2

My answer, to the bob question on the course:

s = 'azcbobobegghaklbob'
total = 0
for i in range(len(s)-2):
    if s[i:i+3] == 'bob':
        total += 1
print 'number of times bob occurs is: ', total

answered Jan 22 '15 at 22:30

Luke D

2,013
3
16
16

score 1 · Answer 10 · answered Sep 02 '15 at 13:22

Here is my edX MIT "find bob"* solution (*find number of "bob" occurences in a string named s), which basicaly counts overlapping occurrences of a given substing:

s = 'azcbobobegghakl'
count = 0

while 'bob' in s:
    count += 1 
    s = s[(s.find('bob') + 2):]

print "Number of times bob occurs is: {}".format(count)

stevoblevo · Answer 11 · 2016-10-29T08:51:32.883

1

An alternative very close to the accepted answer but using while as the if test instead of including if inside the loop:

def countSubstr(string, sub):
    count = 0
    while sub in string:
        count += 1
        string = string[string.find(sub) + 1:]
    return count;

This avoids while True: and is a little cleaner in my opinion

edited Oct 29 '16 at 08:51

answered Oct 29 '16 at 07:45

stevoblevo

78
7

score 1 · Answer 12 · answered Apr 10 '17 at 12:51

1

If strings are large, you want to use Rabin-Karp, in summary:

a rolling window of substring size, moving over a string
a hash with O(1) overhead for adding and removing (i.e. move by 1 char)
implemented in C or relying on pypy

answered Apr 10 '17 at 12:51

Dima Tisnek

11,241
4
68
120

score 1 · Answer 13 · answered Aug 04 '17 at 06:55

1

That can be solved using regex.

import re
def function(string, sub_string):
    match = re.findall('(?='+sub_string+')',string)
    return len(match)

answered Aug 04 '17 at 06:55

Himanshu Bhagwani

76
6

Anshul Tiwari · Answer 14 · 2017-09-20T20:41:47.057

1

def count_substring(string, sub_string):
    counter = 0
    for i in range(len(string)):
        if string[i:].startswith(sub_string):
        counter = counter + 1
    return counter

Above code simply loops throughout the string once and keeps checking if any string is starting with the particular substring that is being counted.

edited Sep 20 '17 at 20:41

answered Sep 19 '17 at 23:12

Anshul Tiwari

361
2
12

score 1 · Answer 15 · answered Sep 19 '21 at 14:00

1

re.subn hasn't been mentioned yet:

>>> import re
>>> re.subn('(?=11)', '', '1011101111')[1]
5

answered Sep 19 '21 at 14:00

Stefan Pochmann

27,593
8
44
107

xerxes · Answer 16 · 2022-07-02T20:11:13.303

1

Solution with replaced parts of the string

s = 'lolololol'
t = 0
t += s.count('lol')
s = s.replace('lol', 'lo1')
t += s.count('1ol')
print("Number of times lol occurs is:", t)

Answer is 4.

edited Jul 02 '22 at 20:11

answered Jul 02 '22 at 20:07

xerxes

11
2

score 0 · Answer 17 · answered Jun 03 '10 at 23:39

def count_overlaps (string, look_for):
    start   = 0
    matches = 0

    while True:
        start = string.find (look_for, start)
        if start < 0:
            break

        start   += 1
        matches += 1

    return matches

print count_overlaps ('abrabra', 'abra')

score 0 · Answer 18 · answered Aug 01 '15 at 15:16

Function that takes as input two strings and counts how many times sub occurs in string, including overlaps. To check whether sub is a substring, I used the in operator.

def count_Occurrences(string, sub):
    count=0
    for i in range(0, len(string)-len(sub)+1):
        if sub in string[i:i+len(sub)]:
            count=count+1
    print 'Number of times sub occurs in string (including overlaps): ', count

score 0 · Answer 19 · edited May 23 '17 at 11:47

0

For a duplicated question i've decided to count it 3 by 3 and comparing the string e.g.

counted = 0

for i in range(len(string)):

    if string[i*3:(i+1)*3] == 'xox':
       counted = counted +1

print counted

edited May 23 '17 at 11:47

Community

1
1

answered Oct 03 '15 at 17:52

Shapi

5,493
4
28
39

score 0 · Answer 20 · answered May 07 '18 at 03:42

This is another example of using str.find() but a lot of the answers make it more complicated than necessary:

def occurrences(text, sub):
    c, n = 0, text.find(sub)
    while n != -1:
        c += 1
        n = text.find(sub, n+1)
    return c

In []:
occurrences('1011101111', '11')

Out[]:
5

pylang · Answer 21 · 2018-09-02T00:04:45.047

Given

sequence = '1011101111'
sub = "11"

Code

In this particular case:

sum(x == tuple(sub) for x in zip(sequence, sequence[1:]))
# 5

More generally, this

windows = zip(*([sequence[i:] for i, _ in enumerate(sequence)][:len(sub)]))
sum(x == tuple(sub) for x in windows)
# 5

or extend to generators:

import itertools as it


iter_ = (sequence[i:] for i, _ in enumerate(sequence))
windows = zip(*(it.islice(iter_, None, len(sub))))
sum(x == tuple(sub) for x in windows)

Alternative

You can use more_itertools.locate:

import more_itertools as mit


len(list(mit.locate(sequence, pred=lambda *args: args == tuple(sub), window_size=len(sub))))
# 5

danbros · Answer 22 · 2019-06-30T11:50:55.263

A simple way to count substring occurrence is to use count():

>>> s = 'bobob'
>>> s.count('bob')
1

You can use replace () to find overlapping strings if you know which part will be overlap:

>>> s = 'bobob'
>>> s.replace('b', 'bb').count('bob')
2

Note that besides being static, there are other limitations:

>>> s = 'aaa'
>>> count('aa') # there must be two occurrences
1 
>>> s.replace('a', 'aa').count('aa')
3

score 0 · Answer 23 · edited Sep 22 '19 at 08:21

0

def occurance_of_pattern(text, pattern):
    text_len , pattern_len = len(text), len(pattern)
    return sum(1 for idx in range(text_len - pattern_len + 1) if text[idx: idx+pattern_len] == pattern)

edited Sep 22 '19 at 08:21

Trenton McKinney

56,955
33
144
158

answered Sep 21 '19 at 15:05

Rajan saha Raju

794
7
13

score 0 · Answer 24 · answered Jan 07 '22 at 20:22

I wanted to see if the number of input of same prefix char is same postfix, e.g., "foo" and """foo"" but fail on """bar"":

from itertools import count, takewhile
from operator import eq


# From https://stackoverflow.com/a/15112059
def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.

    :param iterable: An iterable
    :type iterable: ```Iterable```

    :return: Number of items in iterable
    :rtype: ```int```
    """
    counter = count()
    deque(zip(iterable, counter), maxlen=0)
    return next(counter)


def begin_matches_end(s):
    """
    Checks if the begin matches the end of the string

    :param s: Input string of length > 0
    :type s: ```str```

    :return: Whether the beginning matches the end (checks first match chars
    :rtype: ```bool```
    """
    return (count_iter_items(takewhile(partial(eq, s[0]), s)) ==
            count_iter_items(takewhile(partial(eq, s[0]), s[::-1])))

score -2 · Answer 25 · edited Sep 22 '12 at 04:11

-2

If you want to count permutation counts of length 5 (adjust if wanted for different lengths):

def MerCount(s):
  for i in xrange(len(s)-4):
    d[s[i:i+5]] += 1
return d

edited Sep 22 '12 at 04:11

Himanshu

31,810
31
111
133

answered Sep 20 '12 at 08:19

GrimSqueaker

412
5
17

'count permutation counts' does not make much sense to me. `d` is not a defined name. If the code did run, it would not answer the question. – Terry Jan Reedy Mar 07 '15 at 20:43

String count with overlapping occurrences

25 Answers25

Look-ahead regular expressions

Generate all substrings

How to find a pattern in another string with overlapping

Linked

Related