9

I know this topic has already been discussed multiple times here on StackOverflow, but I'm looking for a better answer.

While I appreciate the differences, I was not really able to find a definitive explanation of why the re module in python provides both match() and search(). Couldn't I get the same behavior with search(), if I prepend ^ in single line mode, and /A in multiline mode? Am I missing anything?

I tried to understand the implementation looking at the _sre.c code and I understand that the search (sre_search()) is actually implemented moving the pointer in the string to be searched, and applying the sre_match() on it, until a match is found.

So I guess that using the re.match() might be slightly faster than the corresponding regular expression (with ^ or /A) using the re.search(). Is that the reason?

I also researched the python-dev ML archives but to no avail.

>>> string="""first line
... second line"""
>>> print re.match('first', string, re.MULTILINE)
<_sre.SRE_Match object at 0x1072ae7e8>
>>> print re.match('second', string, re.MULTILINE)
None
>>> print re.search('\Afirst', string, re.MULTILINE)
<_sre.SRE_Match object at 0x1072ae7e8>
>>> print re.search('\Asecond', string, re.MULTILINE)
None
Community
  • 1
  • 1
spider
  • 1,164
  • 9
  • 16
  • A quick try actually suggests that `match` is slower: http://pastebin.com/VABXxY3H – Evpok Mar 12 '15 at 10:34
  • @vks no, `re.match(r'[\s\S]*', "foo\nbar").group()` – Avinash Raj Mar 12 '15 at 10:47
  • 1
    @AvinashRaj `Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.` https://docs.python.org/2/library/re.html – vks Mar 12 '15 at 10:49
  • yep,, it starts matching the string from start. – Avinash Raj Mar 12 '15 at 11:05
  • @vks yes but `re.match('abc', string, re.MULTILINE)` is equivalent to `re.search('\Aabc', string, re.MULTILINE)` – spider Mar 12 '15 at 11:24
  • @spider that is what....if you dont want such sort of behaviour which `re.match` has you need `re.search` right? – vks Mar 12 '15 at 11:25
  • It's mainly convenience I suppose. While you can use `re.search`, generally `re.match` is what a lot of people need. – Wolph Mar 12 '15 at 11:54
  • Also don't underestimate the improvement of readablility... ruby for example has even introduced the keyword `unless` which is basically `if not` only to make it easier to understand the code. – swenzel Mar 12 '15 at 16:28
  • Might be a bit polemic but if you go crazy about it you could also ask yourself what the `+` is good for if you can have the same functionality with `--` – swenzel Mar 12 '15 at 16:34

1 Answers1

4

As you already know, re.match will test the pattern only at the start of the string and re.search will test all the string until it find a match.

So, is there a difference between re.match('toto', s) and re.search('^toto', s) and what it is?

Lets make a little test:

#!/usr/bin/python

import time
import re

p1 = re.compile(r'toto')
p2 = re.compile(r'^toto')

ssize = 1000

s1 = 'toto abcdefghijklmnopqrstuvwxyz012356789'*ssize
s2 = 'titi abcdefghijklmnopqrstuvwxyz012356789'*ssize

nb = 1000

i = 0
t0 = time.time()
while i < nb:
    p1.match(s1)
    i += 1
t1 = time.time()

i = 0
t2 = time.time()
while i < nb:
    p2.search(s1)
    i += 1
t3 = time.time()

print "\nsucceed\nmatch:"
print (t1-t0)
print "search:"
print (t3-t2)


i = 0
t0 = time.time()
while i < nb:
    p1.match(s2)
    i += 1
t1 = time.time()

i = 0
t2 = time.time()
while i < nb:
    p2.search(s2)
    i += 1
t3 = time.time()

print "\nfail\nmatch:"
print (t1-t0)
print "search:"
print (t3-t2)

The two ways are tested with a string that doesn't match and a string that matches.

results:

succeed
match:
0.000469207763672
search:
0.000494003295898

fail
match:
0.000430107116699
search:
0.46605682373

What can we conclude with these results:

1) The performances are similar when the pattern succeeds

2) The performances are totally different when the pattern fails. This is the most important point because, it means that re.search continues to test each positions of the string even if the pattern is anchored when re.match stops immediatly.

If you increase the size of the failing test string, you will see that re.match doesn't take more time but re.search depends of the string size.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • That's interesting. I tried with both 2.7 and 3.4 and the behaviour is the one you described. Now I'm also more curious, I'll dig a bit more into it, i.e. better looking at the implementation. – spider Mar 13 '15 at 10:07
  • @spider: I'm not sure that this will reveal you more things, especially since there is clearly a linear relation between the time and the failing string size with the `re.search` method when the time is constant with the `re.match` method. – Casimir et Hippolyte Mar 13 '15 at 15:07
  • Casimir I mean that I'm curious to find understand the differences in the two implementations. For the sake of improving my knowledge. I accepted your answer, btw. – spider Mar 13 '15 at 16:39
  • This is not a convincing answer imo, because `re.search` could be optimized not to continue needless searching when there's ^ in the pattern. The fact it hasn't been given such an optimization seems just implementation detail, that's not a particularly good reason for `re.match` to exist. – wim Sep 10 '20 at 15:57