12

I was under the impression that startswith has to be faster than in for the simple reason that in has to do more checks (allows for the word being looked for to be anywhere in the string). But I had my doubts so I decided to timeit. The code for the timings is given below and as you will probably notice I haven't done much timing; the code is rather simple.

import timeit

setup1='''
def in_test(sent, word):
    if word in sent:
        return True
    else:
        return False
'''

setup2='''
def startswith_test(sent, word):
    if sent.startswith(word):
        return True
    else:
        return False
'''

print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1))
print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2))

Results:

>> in:         0.11912814951705597
>> startswith: 0.22812353561129417

So startswith is twice as slow!.. I find this behavior very puzzling given what I said further above. Am I doing something wrong with timing the two or is in indeed faster? If so, why?

Note that the results are very similar even when they both return False (in this case, in would have to actually traverse the whole sentece in case it simply short-circuited before):

print(timeit.timeit('in_test("another standard sentence, would be that", "this")', setup=setup1))
print(timeit.timeit('startswith_test("another standard sentence, would be that", "this")', setup=setup2))

>> in:         0.12854891578786237
>> startswith: 0.2233201940338861

If I had to implement the two functions from scratch it would look something like this (pseudocode):

startswith: start comparing the letters of word to the letters of sentence one by one until a) word gets depleted (return True) or b) check returns False (return False)

in: call startswith for every position where the initial letter of word can be found in sentence.

I just don't get it..


Just to make it clear, in and startswith are not equivallent; I am just talking about the case where the word one is trying to find has to be the first in a string.

Ma0
  • 15,057
  • 4
  • 35
  • 65

3 Answers3

11

This is due to the fact that you have to look-up and invoke a method. in is specialized and leads directly to COMPARE_OP (calling cmp_outcome which, in turn, calls PySequence_Contains) while str.startswith goes through slower byte-code:

2 LOAD_ATTR                0 (startswith)
4 LOAD_FAST                1 (word)
6 CALL_FUNCTION            1              # the slow part

Replacing in with __contains__, forcing a function call for that case too, pretty much negates the speed difference:

setup1='''
def in_test(sent, word):
    if sent.__contains__(word):
        return True
    else:
        return False
'''

And, the timings:

print(timeit.timeit('in_test("this is a standard sentence", "this")', setup=setup1))
print(timeit.timeit('startswith_test("this is a standard sentence", "this")', setup=setup2))
0.43849368393421173
0.4993997460696846

in is winning here because of the fact that it doesn't need to go through the whole function call setup and due to the favorable case it's presented with.

Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
  • The favorable case does not seem to have such an influence; the pessimistic case yielded the same results. More like the length of the string and the overhead of `startswith` – Ma0 Jun 13 '17 at 11:49
  • I should be more verbose @Ev.Kounis :-) The length is what I meant as the favorable case. The fact that there's a match or not in a small string won't play much difference. – Dimitris Fasarakis Hilliard Jun 13 '17 at 11:53
  • 1
    now we are back on the same page xD – Ma0 Jun 13 '17 at 11:54
  • 1
    cmp_outcome link https://github.com/python/cpython/blob/4a8bcdf79cdb3684743fe1268de62ee88bada439/Python/ceval.c#L4933 – Roney Island Jul 28 '21 at 18:24
6

You're comparing an operator on strings -vs- an attribute lookup and a function call. The second one will have a higher overhead, even if the first one takes a long time on a lot of data.

Additionally you're looking for the first word, so if it does match, in will look at just as much data as startswith(). To see the difference you should look at a pessimistic case (no results found, or match at the end of the string):

setup1='''
data = "xxxx"*1000
def ....

print(timeit.timeit('in_test(data, "this")', setup=setup1))
0.932795189000899
print(timeit.timeit('startswith_test(data, "this")', setup=setup2))
0.22242475600069156
viraptor
  • 33,322
  • 10
  • 107
  • 191
  • Thanks for your answer. The pessimistic case is already on the editted post. – Ma0 Jun 13 '17 at 11:40
  • 1
    Your pessimistic case is very short. Looking at 4 characters -vs- 40 is just a rounding error. Looking at 4 -vs- 4000 is where you start seeing the difference. – viraptor Jun 13 '17 at 11:42
  • I understand but I would like to use either one of those to check small sentences like that and not entire texts. But it does make more sense now. – Ma0 Jun 13 '17 at 11:43
  • 2
    So we could argue that `in` is O(n) while `startswith` is O(c), but for small `n`s O(n) is better, right? – Ma0 Jun 13 '17 at 11:53
  • Exactly :) That's it – viraptor Jun 13 '17 at 12:00
0

If you look at bytecode produced by your functions:

>>> dis.dis(in_test)
  2           0 LOAD_FAST                1 (word)
              3 LOAD_FAST                0 (sent)
              6 COMPARE_OP               6 (in)
              9 POP_JUMP_IF_FALSE       16

  3          12 LOAD_CONST               1 (True)
             15 RETURN_VALUE

  5     >>   16 LOAD_CONST               2 (False)
             19 RETURN_VALUE
             20 LOAD_CONST               0 (None)
             23 RETURN_VALUE

you'll notice there is much overhead not directly related to string matching. Doing the test on a simpler function:

def in_test(sent, word):
    return word in sent

will be more reliable.

Błotosmętek
  • 12,717
  • 19
  • 29