69

What's the best way to count the number of occurrences of a given string, including overlap in Python? This is one way:

def function(string, str_to_search_for):
      count = 0
      for x in xrange(len(string) - len(str_to_search_for) + 1):
           if string[x:x+len(str_to_search_for)] == str_to_search_for:
                count += 1
      return count


function('1011101111','11')

This method returns 5.

Is there a better way in Python?

charlesreid1
  • 4,360
  • 4
  • 30
  • 52
calccrypto
  • 8,583
  • 21
  • 68
  • 99

25 Answers25

94

Well, this might be faster since it does the comparing in C:

def occurrences(string, sub):
    count = start = 0
    while True:
        start = string.find(sub, start) + 1
        if start > 0:
            count+=1
        else:
            return count
Jochen Ritzel
  • 104,512
  • 31
  • 200
  • 194
55
>>> import re
>>> text = '1011101111'
>>> len(re.findall('(?=11)', text))
5

If you didn't want to load the whole list of matches into memory, which would never be a problem! you could do this if you really wanted:

>>> sum(1 for _ in re.finditer('(?=11)', text))
5

As a function (re.escape makes sure the substring doesn't interfere with the regex):

def occurrences(text, sub):
    return len(re.findall('(?={0})'.format(re.escape(sub)), text))
>>> occurrences(text, '11')
5
wjandrea
  • 28,235
  • 9
  • 60
  • 81
jamylak
  • 128,818
  • 30
  • 231
  • 230
  • Could you clarify why it would never be a problem? In practice, if there were a lot of matches or the matched substrings were very large, it could cause a lot of memory usage, no? – wjandrea Nov 20 '22 at 19:00
  • 1
    @wjandrea I think maybe I should have said "which would probably never be a problem" because I provided both the `iter` and non `iter` solution. But in this case the text is already in memory so I thought that getting the list would be fine – jamylak Nov 21 '22 at 00:05
22

You can also try using the new Python regex module, which supports overlapping matches.

import regex as re

def count_overlapping(text, search_for):
    return len(re.findall(search_for, text, overlapped=True))

count_overlapping('1011101111','11')  # 5
David C
  • 7,204
  • 5
  • 46
  • 65
  • `import regex as re`? Isn't that confusing? Why not just `import regex`? – wjandrea Nov 20 '22 at 19:01
  • 1
    There's nothing wrong with `import regex` -- documentation shows that approach. `regex`, however, has all the same components as the standard library `re`, so I prefer writing `re.compile`, etc. in way that is familiar and concise. Also, most of the time I end up using `regex` I started with `re` and then found some use case I want to rely on `regex`. I can then update `import re` to `import regex as re` at the top of the file and not have to make other changes. – David C Nov 25 '22 at 14:31
14

Python's str.count counts non-overlapping substrings:

In [3]: "ababa".count("aba")
Out[3]: 1

Here are a few ways to count overlapping sequences, I'm sure there are many more :)

Look-ahead regular expressions

How to find overlapping matches with a regexp?

In [10]: re.findall("a(?=ba)", "ababa")
Out[10]: ['a', 'a']

Generate all substrings

In [11]: data = "ababa"
In [17]: sum(1 for i in range(len(data)) if data.startswith("aba", i))
Out[17]: 2
Community
  • 1
  • 1
Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
4
def count_substring(string, sub_string):
    count = 0
    for pos in range(len(string)):
        if string[pos:].startswith(sub_string):
            count += 1
    return count

This could be the easiest way.

Graham
  • 3,153
  • 3
  • 16
  • 31
Arun Tom
  • 789
  • 6
  • 8
4

A fairly pythonic way would be to use list comprehension here, although it probably wouldn't be the most efficient.

sequence = 'abaaadcaaaa'
substr = 'aa'

counts = sum([
    sequence.startswith(substr, i) for i in range(len(sequence))
])
print(counts)  # 5

The list would be [False, False, True, False, False, False, True, True, False, False] as it checks all indexes through the string, and because int(True) == 1, sum gives us the total number of matches.

not 0x12
  • 19,360
  • 22
  • 67
  • 133
ParkerD
  • 1,214
  • 11
  • 18
3
s = "bobobob"
sub = "bob"
ln = len(sub)
print(sum(sub == s[i:i+ln] for i in xrange(len(s)-(ln-1))))
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
3

How to find a pattern in another string with overlapping

This function (another solution!) receive a pattern and a text. Returns a list with all the substring located in the and their positions.

def occurrences(pattern, text):
    """
    input: search a pattern (regular expression) in a text
    returns: a list of substrings and their positions 
    """
    p = re.compile('(?=({0}))'.format(pattern))
    matches = re.finditer(p, text)
    return [(match.group(1), match.start()) for match in matches]

print (occurrences('ana', 'banana'))
print (occurrences('.ana', 'Banana-fana fo-fana'))

[('ana', 1), ('ana', 3)]
[('Bana', 0), ('nana', 2), ('fana', 7), ('fana', 15)]

Community
  • 1
  • 1
Jose Raul Barreras
  • 849
  • 1
  • 13
  • 19
2

My answer, to the bob question on the course:

s = 'azcbobobegghaklbob'
total = 0
for i in range(len(s)-2):
    if s[i:i+3] == 'bob':
        total += 1
print 'number of times bob occurs is: ', total
Luke D
  • 2,013
  • 3
  • 16
  • 16
1

Here is my edX MIT "find bob"* solution (*find number of "bob" occurences in a string named s), which basicaly counts overlapping occurrences of a given substing:

s = 'azcbobobegghakl'
count = 0

while 'bob' in s:
    count += 1 
    s = s[(s.find('bob') + 2):]

print "Number of times bob occurs is: {}".format(count)
dasdachs
  • 709
  • 10
  • 18
1

An alternative very close to the accepted answer but using while as the if test instead of including if inside the loop:

def countSubstr(string, sub):
    count = 0
    while sub in string:
        count += 1
        string = string[string.find(sub) + 1:]
    return count;

This avoids while True: and is a little cleaner in my opinion

stevoblevo
  • 78
  • 7
1

If strings are large, you want to use Rabin-Karp, in summary:

  • a rolling window of substring size, moving over a string
  • a hash with O(1) overhead for adding and removing (i.e. move by 1 char)
  • implemented in C or relying on pypy
Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120
1

That can be solved using regex.

import re
def function(string, sub_string):
    match = re.findall('(?='+sub_string+')',string)
    return len(match)
1
def count_substring(string, sub_string):
    counter = 0
    for i in range(len(string)):
        if string[i:].startswith(sub_string):
        counter = counter + 1
    return counter

Above code simply loops throughout the string once and keeps checking if any string is starting with the particular substring that is being counted.

Anshul Tiwari
  • 361
  • 2
  • 12
1

re.subn hasn't been mentioned yet:

>>> import re
>>> re.subn('(?=11)', '', '1011101111')[1]
5
Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
1

Solution with replaced parts of the string

s = 'lolololol'
t = 0
t += s.count('lol')
s = s.replace('lol', 'lo1')
t += s.count('1ol')
print("Number of times lol occurs is:", t)

Answer is 4.

xerxes
  • 11
  • 2
0
def count_overlaps (string, look_for):
    start   = 0
    matches = 0

    while True:
        start = string.find (look_for, start)
        if start < 0:
            break

        start   += 1
        matches += 1

    return matches

print count_overlaps ('abrabra', 'abra')
0

Function that takes as input two strings and counts how many times sub occurs in string, including overlaps. To check whether sub is a substring, I used the in operator.

def count_Occurrences(string, sub):
    count=0
    for i in range(0, len(string)-len(sub)+1):
        if sub in string[i:i+len(sub)]:
            count=count+1
    print 'Number of times sub occurs in string (including overlaps): ', count
0

For a duplicated question i've decided to count it 3 by 3 and comparing the string e.g.

counted = 0

for i in range(len(string)):

    if string[i*3:(i+1)*3] == 'xox':
       counted = counted +1

print counted
Community
  • 1
  • 1
Shapi
  • 5,493
  • 4
  • 28
  • 39
0

This is another example of using str.find() but a lot of the answers make it more complicated than necessary:

def occurrences(text, sub):
    c, n = 0, text.find(sub)
    while n != -1:
        c += 1
        n = text.find(sub, n+1)
    return c

In []:
occurrences('1011101111', '11')

Out[]:
5
AChampion
  • 29,683
  • 4
  • 59
  • 75
0

Given

sequence = '1011101111'
sub = "11"

Code

In this particular case:

sum(x == tuple(sub) for x in zip(sequence, sequence[1:]))
# 5

More generally, this

windows = zip(*([sequence[i:] for i, _ in enumerate(sequence)][:len(sub)]))
sum(x == tuple(sub) for x in windows)
# 5

or extend to generators:

import itertools as it


iter_ = (sequence[i:] for i, _ in enumerate(sequence))
windows = zip(*(it.islice(iter_, None, len(sub))))
sum(x == tuple(sub) for x in windows)

Alternative

You can use more_itertools.locate:

import more_itertools as mit


len(list(mit.locate(sequence, pred=lambda *args: args == tuple(sub), window_size=len(sub))))
# 5
pylang
  • 40,867
  • 14
  • 129
  • 121
0

A simple way to count substring occurrence is to use count():

>>> s = 'bobob'
>>> s.count('bob')
1

You can use replace () to find overlapping strings if you know which part will be overlap:

>>> s = 'bobob'
>>> s.replace('b', 'bb').count('bob')
2

Note that besides being static, there are other limitations:

>>> s = 'aaa'
>>> count('aa') # there must be two occurrences
1 
>>> s.replace('a', 'aa').count('aa')
3
danbros
  • 181
  • 1
  • 3
0
def occurance_of_pattern(text, pattern):
    text_len , pattern_len = len(text), len(pattern)
    return sum(1 for idx in range(text_len - pattern_len + 1) if text[idx: idx+pattern_len] == pattern) 
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Rajan saha Raju
  • 794
  • 7
  • 13
0

I wanted to see if the number of input of same prefix char is same postfix, e.g., "foo" and """foo"" but fail on """bar"":

from itertools import count, takewhile
from operator import eq


# From https://stackoverflow.com/a/15112059
def count_iter_items(iterable):
    """
    Consume an iterable not reading it into memory; return the number of items.

    :param iterable: An iterable
    :type iterable: ```Iterable```

    :return: Number of items in iterable
    :rtype: ```int```
    """
    counter = count()
    deque(zip(iterable, counter), maxlen=0)
    return next(counter)


def begin_matches_end(s):
    """
    Checks if the begin matches the end of the string

    :param s: Input string of length > 0
    :type s: ```str```

    :return: Whether the beginning matches the end (checks first match chars
    :rtype: ```bool```
    """
    return (count_iter_items(takewhile(partial(eq, s[0]), s)) ==
            count_iter_items(takewhile(partial(eq, s[0]), s[::-1])))
A T
  • 13,008
  • 21
  • 97
  • 158
-2

If you want to count permutation counts of length 5 (adjust if wanted for different lengths):

def MerCount(s):
  for i in xrange(len(s)-4):
    d[s[i:i+5]] += 1
return d
Himanshu
  • 31,810
  • 31
  • 111
  • 133
GrimSqueaker
  • 412
  • 5
  • 17
  • 'count permutation counts' does not make much sense to me. `d` is not a defined name. If the code did run, it would not answer the question. – Terry Jan Reedy Mar 07 '15 at 20:43