-3

The JavaDoc for java.util.regex.Matcher.find() says:

Attempts to find the next subsequence of the input sequence that matches the pattern.

This method starts at the beginning of this matcher's region, or, if a previous invocation of the method was successful and the matcher has not since been reset, at the first character not matched by the previous match.

If the match succeeds then more information can be obtained via the start, end, and group methods.

This is not what it actually does. And after playing with it for a bit, I have some intuitions about what it actually does, but I wonder if the behavior is actually documented anywhere.

Some examples:

Pattern.compile("a|ad").matcher("ad").find() --> group() = "a"
Pattern.compile("ad|a").matcher("ad").find() --> group() = "ad"

Clearly, the subsequence a matches both patterns, but the second matcher skips over a and finds ad as the "next subsequence that matches the pattern".

Similarly, I think we can all agree that [abc]+ matches a single a, b, or c, but

Pattern.compile("[abc]+").matcher("ababab").find() --> group = "ababab"

which skips over the fact that a is a perfectly fine match for the pattern.

I think what's happening is that, given that the implementation is pattern-based, it's trying pieces of the pattern in some order. Thus a|ad matches a and ignores d, but ad|a does the opposite. [abc]+ greedily matches, even when it's looking for the next subsequence match.

So the question is, what should the JavaDoc say? It's not the longest subsequence that matches (see a vs ad), and it's not the first subsequence that matches (see ababab vs a). So what is this method actually doing, and is there a way to pin it down to a reasonable specification?

Note that I understand what is going on here. I'm simply pointing out that the behavior of this method doesn't match the JavaDoc and that it's not clear how you could fix the JavaDoc without explicitly describing the implementation of the method. find does not find "the next subsequence that matches the pattern". It finds a next subsequence that matches the pattern based not just on what strings match the pattern, but also how the pattern is constructed.

Todd O'Bryan
  • 2,234
  • 17
  • 30
  • 1
    It's testing each part of the expression in order. Quantifiers are greedy by default and try to match the longest subsequence possible. If there's no matching sequence, it backtracks and tries the next variant. I'm not sure where the ambiguity is, especially in that piece of Javadoc. – shmosel Jul 03 '23 at 23:33
  • 1
    Looks exactly like I'd expect. For alternation, it's matching the first alternative, so does not try the second. What would you have it do? How should it determine which alternative is "best"? – Arfur Narf Jul 03 '23 at 23:50
  • The reason this came up is that I'm working on an FSA-based implementation. My results don't match the results Java gives, but do match what the documentation says. Say `Pattern x;` `x.matcher("abc");` `x.find()` is true and `x.group()` is `"abc"`. I think from the JavaDoc it's safe to assume that the subsequences `"a"`, `"b"`, and `"ab"` do NOT match the pattern, since the doc says the find method returns the "next subsequence that matches the pattern." However, as we've seen, that's not true. – Todd O'Bryan Jul 04 '23 at 21:11
  • I see. But I think you're reading too much into "next subsequence." To me it just means the subsequence starting at the nearest index, without specifying whether longer or shorter sequences are prioritized. – shmosel Jul 04 '23 at 21:40

2 Answers2

1

Java's regular expressions use a backtracking implementation, so given a pattern like x|y, it will first attempt to match x, and if that fails, reset and try y. Thus, the order matters.

For the first example, both a and ad will at least partially match the string "ad", so whichever pattern is given first to the or operator is found as the match.

For the second example, + is a greedy quantifier, so it will attempt to match as much as possible. In this case, the whole string matches. To match as little as possible, the reluctant +? should be used, which would only match a single "a".

The documentation for this specific Matcher method does not explain all these details about regular expressions, but it is not incorrect.

Unmitigated
  • 76,500
  • 11
  • 62
  • 80
  • It is incorrect. It does not find the next subsequence – Todd O'Bryan Jul 04 '23 at 20:54
  • I will concede that it finds _a_ next subsequence. – Todd O'Bryan Jul 04 '23 at 21:18
  • @ToddO'Bryan The documentation states nothing about how the subsequences are prioritized, so I don't see a meaningful distinction between "the" and "a" in this context. Clearly, the order doesn't match your expectations, but that's not the same as being wrong. – Unmitigated Jul 04 '23 at 22:55
  • Right. And without some information about that, how do you decide what the right implementation would be? (Well, other than matching the reference implementation...) – Todd O'Bryan Jul 05 '23 at 17:38
0

"... Clearly, the subsequence a matches both patterns, but the second matcher skips over a and finds ad as the "next subsequence that matches the pattern". ..."

It's not skipping over a, as the pattern has specified ad first.

"... It's not the longest subsequence that matches (see a vs ad) ..."

It would be the longest subsequence of the input, not the pattern.

The regex engine is going to try and match the first supplied pattern, not the shortest of the two.

"... Similarly, I think we can all agree that [abc]+ matches a single a, b, or c, but ... which skips over the fact that a is a perfectly fine match for the pattern. ..."

The + is specifying that it should match as many as possible.
You could append a ?, [abc]+?.  This will cause it to match just the a, and then the b, and so on and so forth.

"... It finds a next subsequence that matches the pattern based not just on what strings match the pattern, but also how the pattern is constructed."

Exactly; I guess they are inferring this.  I find that most technical writing has some sort of unintentional inferred context.
You could, of course, just inquire about this to the company.  I'm sure they'd appreciate the notice.

Essentially, a regex engine is very simplistic, it just works from left to right, within a loop.

I suggest reading the Wikipedia article on regex, it covers all of these topics.
Wikipedia – Regular expression.

Reilas
  • 3,297
  • 2
  • 4
  • 17
  • Actually, this is an assumption about regex engines. With a FSA-based interpretation, you can lose information about the actual pattern, since the FSA just records what is accepted or not. – Todd O'Bryan Jul 05 '23 at 17:37
  • @ToddO'Bryan, what information? – Reilas Jul 05 '23 at 22:20
  • The order of the alternates, for example. --> (1) --a--> ((2)) --d--> ((3)) is a reasonable DFA representation of the regex `a|ad` (where 1 is the start state and 2 and 3 are both accepting states). This DFA matches exactly the strings matched by `a|ad`, but you can't tell which order they came in the original regex. – Todd O'Bryan Jul 06 '23 at 01:56
  • @ToddO'Bryan, I don't see how that correlates to a regex engine working in a loop from left to right. You can view the _Java_ code if you want, _[here](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/regex/Matcher.java)_. – Reilas Jul 06 '23 at 02:54
  • Right. I'm implementing it in a different way. And the only way to tell what the method does it to run it or look at the code. But we shouldn't care how it's implemented, just what it does. That way you can switch implementations without consumers needing to know. My point is that the behavior is very difficult to describe without referring to the implementation. That's usually a code smell. – Todd O'Bryan Jul 06 '23 at 17:58
  • @ToddO'Bryan, that's great. If the documentation infers a context, send them an email. – Reilas Jul 06 '23 at 20:26