4

There was a question about regex and trying to answer I found another strange things.

String x = "X";
System.out.println(x.replaceAll("X*", "Y"));

This prints YY. why??

String x = "X";
System.out.println(x.replaceAll("X*?", "Y"));

And this prints YXY

Why reluctant regex doesn't match 'X' character? There is "noting"X"nothing" but why first doesn't match three symbols and matches two and then one instead of three? and second regex matches only "nothing"s and not X?

shift66
  • 11,760
  • 13
  • 50
  • 83

3 Answers3

8

Let's consider them in turn:

"X".replaceAll("X*", "Y")

There are two matches:

  1. At character position 0, X is matched, and is replaced with Y.
  2. At character position 1, the empty string is matched, and Y gets added to the output.

End result: YY.

"X".replaceAll("X*?", "Y")

There are also two matches:

  1. At character position 0, the empty string is matched, and Y gets added to the output. The character at this position, X, was not consumed by the match, and is therefore copied into the output verbatim.
  2. At character position 1, the empty string is matched, and Y gets added to the output.

End result: YXY.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • At first case, in second step (2. At character position 1...) but there is no position 1, it's out of the string's bound, isn't it? after first step everything should be over because string ended. – shift66 Feb 10 '12 at 13:38
  • @Ademiban: Not quite. There is a position `1`. Consider the following regex: `"$"`. By definition, the *only* place where it can match is after the last character of the string. In this example, that would be at position `1`. The same thing happens to regexes that can produce zero-length matches. – NPE Feb 10 '12 at 13:39
  • Great answer! Let me add a possibly interesting note though ;) In the second scenario, the X is not matched since *? implies a lazy match, i.e. the elements before the *? are preferably not matched if that still yields a valid result. – Willem Mulder Feb 10 '12 at 13:44
  • @aix, you mean there's a kind of end of the line symbol on position kind of 1? – shift66 Feb 10 '12 at 13:48
  • @Ademiban: Pretty much. In other words, a regex that is entirely optional can produce a match right *after* the final character of the string. – NPE Feb 10 '12 at 13:49
1

The * is a tricky 'quantifier' since it means '0 or more'. Thus, it also matches '0 times X' (i.e. an empty string).

I would use

"X".replaceAll("X+", "Y")

which has the expected behaviour.

Willem Mulder
  • 12,974
  • 3
  • 37
  • 62
0

In your first example you are using a "Greedy" quantifier. This means that the input string is forced to be read entirely before attempting the first match, so the first match tried is the whole input. If the input matches, the matcher goes past the input string and performs the zero-length match at the end of the string hence the two matches you see. The greedy matcher never backs-off to the zero-length match before the character X before the first match attempt was successful.

On the second example you are using a "Reluctant" quantifier which does the opposite of "Greedy". It starts at the beginning and tries to match one character at the time going forward (if it has to). So the zero-length match before the "X" character is matched, matcher moves forward by one (that's why you still see the "X" character in the output) where the next match is now the zero-length match after the "X".
There is a good tutorial here: http://docs.oracle.com/javase/tutorial/essential/regex/quant.html

cs0lar
  • 367
  • 1
  • 2