4

There is an API that I'm calling which I cannot change. That is, I cannot do this as two sequential regexes or anything like that. The API is written something like this (simplified, of course):

void apiMethod(final String regex) {
    final String input = 
        "bad:    thing01, thing02, thing03 \n" +
        "good:   thing04, thing05, thing06 \n" +
        "better: thing07, thing08, thing09 \n" +
        "worse:  thing10, thing11, thing12 \n";

    final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);

    final Matcher matcher = pattern.matcher(input);

    while (matcher.find()) {
        System.out.println(matcher.group(1));
    }
}

I invoke it something like this:

apiMethod("(thing[0-9]+)");

I want to see six lines printed out, one for each thing 04 through 09, inclusive. I have not been successful so far. Some things I have tried that did not work:

  • "(thing[0-9]+)" - This matches all 12 things, which is not what I want.
  • "^(?:good|better): (thing[0-9]+)" - This matches only things 4 and 7.
  • "^(?:(?:good|better): .*)(thing[0-9]+)" - This matches only things 6 and 9.
  • "(?:(?:^good:|^better:|,) *)(thing[0-9]+)" - This matches everything except 1 and 10.

And many more, too numerous to list. I've tried various look-behinds, to no avail.

What I want is all the strings that match "thing[0-9]+" but only those from lines that begin with "good:" or "better:".

Or, stated more generally, I want multiple matches from a multiline pattern but only from lines with a certain prefix.

Matt Malone
  • 361
  • 4
  • 25

1 Answers1

5

You have to use a \G based pattern (in multiline mode):

(?:\G(?!^),|^(?:good|better):)\s*(thing[0-9]+)

The \G anchor forces matches to be contiguous since it matches the position after the last successful match.


If lines are short, you can also do that using a limited variable-length lookbehind:

(?<=^(?:good|better):.{0,1000})(thing[0-9]+)
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Today I learned about the \G anchor. Thanks a lot! By the way, what does the (?!^) do? I know it's negative lookahead for the start of line anchor, but why is it needed? – Matt Malone Nov 10 '17 at 22:53
  • Nice regex, but you don't need the negative look ahead for start `(?!^)` because lines never start with a comma. ie this works: `"(?:\\G,|^(?:good|better):)\\s*(thing\\d+)"` – Bohemian Nov 10 '17 at 22:53
  • @Matt it isn't needed. See my comment. – Bohemian Nov 10 '17 at 22:53
  • 1
    @MattMalone: because `\G` matches also the start of the string. Adding `(?!^)` avoid this case, but indeed if you don't have lines that start with a comma, you can remove it. – Casimir et Hippolyte Nov 10 '17 at 22:55