The JavaDoc for java.util.regex.Matcher.find()
says:
Attempts to find the next subsequence of the input sequence that matches the pattern.
This method starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.
If the match succeeds then more information can be obtained via the start, end, and group methods.
This is not what it actually does. And after playing with it for a bit, I have some intuitions about what it actually does, but I wonder if the behavior is actually documented anywhere.
Some examples:
Pattern.compile("a|ad").matcher("ad").find() --> group() = "a"
Pattern.compile("ad|a").matcher("ad").find() --> group() = "ad"
Clearly, the subsequence a
matches both patterns, but the second matcher skips over a
and finds ad
as the "next subsequence that matches the pattern".
Similarly, I think we can all agree that [abc]+
matches a single a
, b
, or c
, but
Pattern.compile("[abc]+").matcher("ababab").find() --> group = "ababab"
which skips over the fact that a
is a perfectly fine match for the pattern.
I think what's happening is that, given that the implementation is pattern-based, it's trying pieces of the pattern in some order. Thus a|ad
matches a
and ignores d
, but ad|a
does the opposite. [abc]+
greedily matches, even when it's looking for the next subsequence match.
So the question is, what should the JavaDoc say? It's not the longest subsequence that matches (see a
vs ad
), and it's not the first subsequence that matches (see ababab
vs a
). So what is this method actually doing, and is there a way to pin it down to a reasonable specification?
Note that I understand what is going on here. I'm simply pointing out that the behavior of this method doesn't match the JavaDoc and that it's not clear how you could fix the JavaDoc without explicitly describing the implementation of the method. find
does not find "the next subsequence that matches the pattern". It finds a next subsequence that matches the pattern based not just on what strings match the pattern, but also how the pattern is constructed.