5

I wanted to count the number of times that a string like 'aa' appears in 'aaa' (or 'aaaa').

The most obvious code gives the wrong (or at least, not the intuitive) answer:

'aaa'.count('aa')
1 # should be 2
'aaaa'.count('aa')
2 # should be 3

Does anyone have a simple way to fix this?

nivk
  • 685
  • 1
  • 6
  • 21

4 Answers4

10

From str.count() documentation:

Return the number of non-overlapping occurrences of substring sub in the range [start, end]. Optional arguments start and end are interpreted as in slice notation.

So, no. You are getting the expected result.

If you want to count number of overlapping matches, use regex:

>>> import re
>>> 
>>> len(re.findall(r'(a)(?=\1)', 'aaa'))
2

This finds all the occurrence of a, which is followed by a. The 2nd a wouldn't be captured, as we've used look-ahead, which is zero-width assertion.

Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
  • Well, yes. But the question is how to calculate it instead? – Simeon Visser Oct 10 '13 at 17:40
  • 1
    I mean, you're 100% correct. But that's not helpful :/ – nivk Oct 10 '13 at 17:40
  • 1
    You can instead change your OP to 'how to count overlapping occurrences of a substring', to emphasize that you want to know HOW, rather than why it is giving an answer that is not intuitive to you. – hexparrot Oct 10 '13 at 17:42
6
haystack = "aaaa"
needle   = "aa"

matches  = sum(haystack[i:i+len(needle)] == needle 
               for i in xrange(len(haystack)-len(needle)+1))

# for Python 3 use range instead of xrange
kindall
  • 178,883
  • 35
  • 278
  • 309
1

The solution is not taking overlap into consideration.

Try this:

big_string = "aaaa"
substring = "aaa"
count = 0 

for char in range(len(big_string)):
    count += big_string[char: char + len(subtring)] == substring

print count
Lucas Ribeiro
  • 6,132
  • 2
  • 25
  • 28
0

You have to be careful, because you seem to looking for non-overlapping substrings. To fix this I'd do:

len([s.start() for s in re.finditer('(?=aa)', 'aaa')])

And if you don't care about the position where the substring starts you can do:

len([_ for s in re.finditer('(?=aa)', 'aaa')])

Although someone smarter than myself might be able to show that there are performances differences :)

mlnyc
  • 2,636
  • 2
  • 24
  • 29