0

So I have this regex

 ^<(.*?)>

Which is supposed to match the contents of the first opening tag. However while this works in PHP, in java it matches everything in between the first < and the last >.

For example, when it is run on this:

<tag1 attr1="val1"><tag2></tag2></tag1>

PHP Matches:

 tag1 attr1="val1"

while Java matches

tag1 attr1="val1"><tag2></tag2></tag1
John
  • 898
  • 2
  • 9
  • 25
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – itdoesntwork Nov 09 '13 at 23:36
  • 2
    How are you using it? It works for me, check [this IDEOne working example](http://ideone.com/VruCSF). – BackSlash Nov 09 '13 at 23:38
  • Ok mistake was on my part, that's what being a noob in java does to you. I used if(m.matches()) instead of m.find(). By the way, can someone explain why it works that way? – John Nov 09 '13 at 23:45
  • John - if you want people to explain things, you need to state what you want explained a lot more clearly than that! – Stephen C Nov 09 '13 at 23:48
  • 2
    @John because the `matches` method checks the _entire_ string. When you use the `matches` method on a regex like `^<(.*?)>`, it's the same as writing `^<(.*?)>$`, which is true, as your string starts with `<` and ends with `>` – BackSlash Nov 09 '13 at 23:51
  • Stephen Java docs says that matches() "Attempts to match the entire region against the pattern." while find() "finds the next subsequence of the input sequence that matches the pattern". How come the first makes the non-greedy quantifier .*? work like a greedy one? – John Nov 09 '13 at 23:53
  • BackSlash Thanks! That was a very good explanation. – John Nov 09 '13 at 23:54
  • @John You're welcome :) That's the difference between the `find` (which I used in my example) and the `matches`: the `find` checks for any part of the string to match the regex, the `matches` checks for the _whole_ string to match the regex :) – BackSlash Nov 09 '13 at 23:55
  • 1
    @John I posted it as an answer, if you want you can accept it :) – BackSlash Nov 10 '13 at 00:04

2 Answers2

2
String s1="<tag1 attr1=\"val1\"><tag2></tag2></tag1>";
Pattern p = Pattern.compile("^<(.*?)>");
Matcher m = p.matcher(s1);
while(m.find()) {
    System.out.println(m.group(1));
}

This is the code which I tested, and it returned tag1 attr1="val1".

Then, in comments, you said that you were using the matches method: that is the difference.

While the find method checks for any part of the string matching the regex, the matches method requires the entire string to match the given regex.

So, in your example:

while(m.find()) {
    System.out.println(m.group(1)); //will print   tag1 attr1="val1"
}

if (m.matches()) { //will evaluate the regex as ^<(.*?)>$
    System.out.println(m.group(1)); //will print    tag1 attr1="val1"><tag2></tag2></tag1
}
BackSlash
  • 21,927
  • 22
  • 96
  • 136
0

What I didn't spot the first time is that you are explicitly using non-greedy repetition (*?).

But my original points still stand:

  • There is no difference between the semantics of PHP and Java regexes in this respect.

  • Using Java find versus Java matches does not change the semantics of the regex. Specifically it does not flip non-greedy to greedy, or vice versa. (As you postulated in a comment.)

The reason that find succeeds (multiple times) and matches doesn't is solely down to the fact that matches must match the entire string.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216