18

Possible Duplicate:
String.replaceAll() anomaly with greedy quantifiers in regex
Strange behavior in regexes

While

"a".replaceAll("a", "b")
"a".replaceAll("a+", "b")
"a".replaceAll("a+?", "b")

all return b, why does

"a".replaceAll("a*", "b")

return bb and

"a".replaceAll("a*?", "b")

return bab?

Community
  • 1
  • 1
sp00m
  • 47,968
  • 31
  • 142
  • 252

3 Answers3

22
"a".replaceAll("a*", "b")

First replaces a to b, then advances the pointer past the b. Then it matches the end of string, and replaces with b. Since it matched an empty string, it advances the pointer, falls out of the string, and finishes, resulting in bb.

"a".replaceAll("a*?", "b")

first matches the start of string and replaces with b. It doesn't match the a because ? in a*? means "non-greedy" (match as little as possible). Since it matched an empty string, it advances the pointer, skipping a. Then it matches the end of string, replaces with b and falls out of the string, resulting in bab. The end result is the same as if you did "a".replaceAll("", "b").

John Dvorak
  • 26,799
  • 13
  • 69
  • 83
  • @mikeslattery If the pointer didn't advance, then the replacer would be stuck in an infinite loop anytime you matched an empty string. – John Dvorak Feb 04 '13 at 17:53
  • That prints bbb. It seems more logical to me: echo a | perl -i -pe "s/a*?/b/g" – jdb Feb 04 '13 at 18:16
  • @jdb what are the rules in perl? If a match is empty, the pointer doesn't move but the next match is restricted to be non-empty with the usual rules of backtracking the last group first? – John Dvorak Feb 04 '13 at 18:20
  • Advancing to the next character because the empty space was a match explains the problem but does not make it right. I don't know what the Perl rules are but it just make more sense. Python has the same bug. – jdb Feb 04 '13 at 18:46
  • @jdb I wouldn't call it a bug. It does make sense and it's definitely intentional. Javascript behaves as Java (skips a character). – John Dvorak Feb 04 '13 at 19:25
  • Thank you for your explanation. I agree with @jbd, it would make more sense to me if `a*?` on `a` returned `bbb`. – sp00m Feb 04 '13 at 20:48
5

This happens because of zero-width matches.


"a".replaceAll("a*", "b")

Will match two times:

  1. Try match at beginning of the string, greedy * consumes the a as a match.
  2. Advance to the next position in the string (now at end of string), try match there, empty string matches.

    " a "
     \| \___ 2. match empty string
      \_____ 1. match "a"
    


"a".replaceAll("a*?", "b")

Will also match two times:

  1. Try match at beginning of the string, non-greedy *? matches the empty string without consuming the a.
  2. Advance to next position in the string (now at end of string), try match there, empty string matches.

    " a "
     \  \___ 2. match empty string
      \_____ 1. match empty string
    
Qtax
  • 33,241
  • 9
  • 83
  • 121
1

in "a".replaceAll("a*", "b")

* - for 0 or more so here in `a*`
1. a - replaced by b
2. * - is also true as empty string is true for *,so, replaced by b 

and in "a".replaceAll("a*?", "b")

1. *? - makes anything non-greedy means it makes regrex to match as little
        as possible,
2. so,the pre and post empty string would be replaced by "b" and 
   as a*? is non-greedy, it will not replace it 
sourcecode
  • 1,802
  • 2
  • 15
  • 17