3

I posted this question earlier.

But that wasn't quite the end of it. All the rules that applied there still apply.

So the strings:

  • "%ABC%" would yield ABC as a result (capture stuff between percent signs)
  • as would "$ABC." (capture stuff after $, giving up when another dollar or dot appears)
  • "$ABC$XYZ" would too, and also give XYZ as a result.

To add a bit more to this:

  • "${ABC}" should yield ABC too. (ignore curly braces if present - non capture chars perhaps?).
  • if you have two successive dollar signs, such as "$$EFG", or "$${EFG}",
    that should not appear in a regex result. (This is where either numbered or named back- references come into play - and the reason I contemplated them as non-capture groups). As I understand it, a group becomes a non-capture group with this syntax (?:).

1) Can I say the % or $ is a non-capture group and reference that by number? Or do only capture groups get allocated numbers?

2) What is the order of the numbering, if you have ((A) (B) (C)). Is the outer group 1, A 2, B 3 C 4?

I have been look at named groups. Saw the syntax mentioned here

(?<name>capturing text) to define a named group "name"

\k<name> to backreference a named group "name"

3) Not sure if a non-capture group can be named in Java? Can someone elucidate?

  • More info here on non capture groups.
  • More info here on lookbehinds
  • Similar answer to a question here, but didn't quite get me what I wanted. Not sure if there is a back-reference issue in Java.
  • Similar question here. But could not get my head around the working version to apply to this.

I have used the exact same Java I had in my original question, except for:

String search = "/bla/$V_N.$$XYZ.bla";
String pattern = "(?:(?<oc>[%$]))(?!(\\k<oc>))([^%.$]*)+";

This should only result in V_N.

I am really struggling with this one, and wondered if someone can help me work out how to solve this. Thanks.

JGFMK
  • 8,425
  • 4
  • 58
  • 92
  • Are there restrictions that make simplest solutions non-viable? You could just preprocess the text and remove every instance of `{}` and double `$` with one pass through the text, almost guaranteed to be faster than regex backtracking solution – Deltharis Nov 12 '19 at 21:50
  • Because I have to process the ones with the double $$ separately too.it's a possibility, but I'd rather get a better appreciation of regex with Java as an outcome of this. – JGFMK Nov 12 '19 at 21:53
  • 1
    You may write an expanded regex with multiple capturing groups and only grab those that are not null - `%([^%.]+)%|(?<!\$)\$(?:\{([^{}]+)\}|([^$.]+))`, see https://regex101.com/r/7Q6EAD/1 – Wiktor Stribiżew Nov 12 '19 at 22:10
  • A simple group of separate conditions would go a long way to getting an answer. As it is now, its hard to tell what those conditions are. Literals work best, maybe 2 columns: Pass/Fail work the best. –  Nov 12 '19 at 22:13
  • How about something loose like `(?<=(?<!\$)[%$]\{?)[^%$.\s}{]+` – bobble bubble Nov 12 '19 at 22:43
  • I have been thinking about this overnight and wondered if rather than two sets of regexes for the $$ and $ version within the same string, the position of the character before the found match (defensively checking if 0), contains a $, I could simplify the regex slightly and break out two sets of results based on that. – JGFMK Nov 13 '19 at 06:56
  • @WiktorStribiżew - If you want to submit that as an answer, I'll accept that. Can see by using Regex101 that non capture groups don't get allocated a number for back referencing. – JGFMK Nov 13 '19 at 08:19
  • The difference between a capturing group and a noncapturing group is that the former’s match can be referenced later-on whereas the latter can not. So it makes no sense to ask for a way to reference a noncapturing group. If you want to reference a group, use a capturing group. That’s what it is for. – Holger Nov 13 '19 at 10:03
  • @Holger - it was the negative look behind that WiktorStribiżew gave as a solution that was the correct way to go when dealing with the double $ symbol conundrum. I became confusied, thinking the back referencing it with \\1 or a named group - in conjunction with (?!\1) would have been a way to solve it. (per answer here https://stackoverflow.com/a/16717823/495157 ) But since I didn't want to capture the $ symbol, that thinking muddied the waters. Couldn't think of elegant way to deal with complex groups within groups and knowing which ones to get. – JGFMK Nov 13 '19 at 10:19
  • I see; you asked for the numbering of nested groups. The answer is simple. The order of the opening brackets of the capturing groups matters. – Holger Nov 13 '19 at 10:36
  • @Holger - My assumption was correct. And I confirmed that with a comment below the answer Wiktor gave. – JGFMK Nov 13 '19 at 10:37
  • Well, yes. But for the next time, don’t make assumptions, just go directly to [the documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#cg)… – Holger Nov 13 '19 at 10:43
  • Cheers for link - have bookmarked that section for future reference ;-) – JGFMK Nov 13 '19 at 10:45

1 Answers1

2

You may write a little bit more verbose regex with multiple capturing groups and only grab those that are not null, or plainly concatenate the found group values since there will be always only one of them initialized upon each match:

%([^%.]+)%|(?<!\$)\$(?:\{([^{}]+)\}|([^$.]+))

See the regex demo.

Details

  • %([^%.]+)% - %, Group 1: one or more chars other than % and ., then a % is consumed
  • | - or
  • (?<!\$) - a negative lookbehind that matches a location in string that is not immediately preceded with $
  • \$ - a $
  • (?: - start of the non-capturing container group matching either of:
    • \{([^{}]+)\} - {, Group 2: any one or more chars other than { and }, then } is consumed
    • | - or
    • ([^$.]+) - Group 3: 1 or more chars other than $ and .
  • ) - end of the non-capturing container group.

Java usage:

String regex = "%([^%.]+)%|(?<!\\$)\\$(?:\\{([^\\{}]+)\\}|([^$.\\s]+))";
String string = "%ABC%\n$ABC.\n$ABC$XYZ  ${ABC}\n\n$$EFG $${EFG}.";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher m = pattern.matcher(string);
List<String> results = new ArrayList<>();
while (m.find()) {
    results.add(Objects.toString(m.group(1),"") + 
        Objects.toString(m.group(2),"") + 
        Objects.toString(m.group(3),""));
}
System.out.println(results); // => [ABC, ABC, ABC, XYZ, ABC]

Mind that in regular Java string literals, \ should be escaped (i.e. \\) to introduce a single literal backslash that is used as part of regex escapes.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • To anyone else wanting answers to the other questions I posed. Non-capture groups can't be back-referenced by either number Q1 or name Q3. And for Q2, regex101.com provides a nice colour coded answer and you can hover over the regex expression to determine that. My assumption was correct. – JGFMK Nov 13 '19 at 10:29
  • Just wish Java would follow C# and preface double quotes with @ character, so things like \ don't need escaping! ;-) https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/strings/index (Verbatim string literals) – JGFMK Nov 13 '19 at 10:48
  • Really like the Objects.toString() solution. Had forgotten too that you always have a capture group 0 too. Just been reading up about that in *Ken Kousen's* **Modern Java Recipes** too. Thx to @Holger for the link https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#cg – JGFMK Nov 13 '19 at 11:23
  • @JGFMK Right, Java does not support raw string literals. Scala, Koltin, Groovy do have something similar (triple quoted string literals, slashy strings), but Java lags behind here. – Wiktor Stribiżew Nov 13 '19 at 11:26