0

I do not understand how regex string matching works

r2 = r'a[bcd]*b'
m1 = re.findall(r2,"abcbd")
abcb

This falls in line with what was explained in regex

Step 3 The engine tries to match b, but the current position is at the end of the string, so it fails.

How?I do not understand this?

fabiano.mota
  • 187
  • 1
  • 5
  • 15
  • 1
    This is called *backtracking*. – Wiktor Stribiżew Jul 28 '17 at 08:09
  • Possibly related: https://stackoverflow.com/questions/9011592/in-regular-expressions-what-is-a-backtracking-back-referencing But I really don't get your question :( – Sebastian Proske Jul 28 '17 at 08:13
  • Look and play with your regex: https://regex101.com/r/rAJLc0/5. You have error in "step 3" - The engine includes in match last b in abcbd, and stops. It is not failure, you have found a substring that matches. – Gangnus Jul 28 '17 at 08:18
  • See also [this post of mine](http://stackoverflow.com/questions/33869557/can-i-improve-performance-of-this-regular-expression-further/33869801#33869801), it might help you understand how regexes with quantifiers work. – Wiktor Stribiżew Jul 28 '17 at 08:40

2 Answers2

4

The following regex a[bcd]*b matches the longest substring (because * is greedy):

  • a starting with a
  • [bcd]* followed by any number (0: can match empty string) of character in set (b,c,d)
  • b ending by b

EDIT: following comment, backtracking occurs in following example

>>> re.findall(r2,"abcxb")
['ab']
  • abc matches a[bcd]*, but x is not expected
  • a also matches a[bcd]* (because empty string matches [bcd]*)
  • finally returns ab

Concerning greediness, the metacharacter * after a single character, a character set or a group, means any number of times (the most possible match) some regexp engines accept the sequence of metacharacters *? which modifies the behavior to the least possible, for example:

>>> r2 = r'a[bcd]*?b'
>>> re.findall(r2,"abcbde")
['ab']
Nahuel Fouilleul
  • 18,726
  • 2
  • 31
  • 36
2

Your regular expression requires the match to end in b, therefore everything is matched up to the trailing d. If b were optional, as in a[bcd]*b?, then entire string would be matched.

Vadim Landa
  • 2,784
  • 5
  • 23
  • 33