3

I have a search string. When it contains a dollar symbol, I want to capture all characters thereafter, but not include the dot, or a subsequent dollar symbol.. The latter would constitute a subsequent match. So for either of these search strings...:

"/bla/$V_N.$XYZ.bla";
"/bla/$V_N.$XYZ;

I would want to return:

  • V_N
  • XYZ

If the search string contains percent symbols, I also want to return what's between the pair of % symbols.

The following regex seems do the trick for that.

 "%([^%]*?)%";

Inferring:

  • Start and end with a %,
  • Have a capture group - the ()
  • have a character class containing anything except a % symbol, (caret infers not a character)
  • repeated - but not greedily *?

Where some languages allow %1, %2, for capture groups, Java uses backslash\number syntax instead. So, this string compiles and generates output.

I suspect the dollar symbol and dot need escaping, as they are special symbols:

  • $ is usually end of string
  • . is a meta sequence for any character.

I have tried using double backslash symbols.. \

  • Both as character classes .e.g. [^\\.\\$%]
  • and using OR'd notation %|\\$

in attempts to combine this logic and can't seem to get anything to play ball.

I wonder if another pair of eyes can see how to solve this conundrum!

My attempts so far:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Main {
  public static void main(String[] args) {
        String search = "/bla/$V_N.$XYZ.bla";
        String pattern = "([%\\$])([^%\\.\\$]*?)\\1?";
  /* Either % or $ in first capture group ([%\\$])
   * Second capture group - anything except %, dot or dollar sign
   * non greedy group ( *?)
   * then a backreference to an optional first capture group \\1?
   * Have to use two \, since you escape \ in a Java string.
   */
        Pattern r = Pattern.compile(pattern);
        Matcher m = r.matcher(search);
        List<String> results = new ArrayList<String>();
          while (m.find()) 
        { 
          for (int i = 0; i<= m.groupCount(); i++) {
                results.add(m.group(i));
          }
        }
        for (String result : results) {
          System.out.println(result);
        }
  }
}

The following links may be helpful:

JGFMK
  • 8,425
  • 4
  • 58
  • 92

1 Answers1

5

You may use

String search = "/bla/$V_N.$XYZ.bla";
String pattern = "[%$]([^%.$]*)";
Matcher matcher = Pattern.compile(pattern).matcher(search);
while (matcher.find()){
    System.out.println(matcher.group(1)); 
} // => V_N, XYZ

See the Java demo and the regex demo.

NOTE

  • You do not need an optional \1? at the end of the pattern. As it is optional, it does not restrict match context and is redundant (as the negated character class cannot already match neither $ nor%)
  • [%$]([^%.$]*) matches % or $, then captures into Group 1 any zero or more chars other than %, . and $. You only need Group 1 value, hence, matcher.group(1) is used.
  • In a character class, neither . nor $ are special, thus, they do not need escaping in [%.$] or [%$].
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Think I'd need a non greedy search. – JGFMK Nov 12 '19 at 15:32
  • @JGFMK No, you do not not. The negated character class already does it. – Wiktor Stribiżew Nov 12 '19 at 15:34
  • 1
    I suspect this will fail to match pairs (assuming that's a requirement). For example, try `"/bla/$V_N%.$XYZ.bla"` as input – ernest_k Nov 12 '19 at 15:41
  • 1
    @ernest_k That does cause a bit of a hiccup in the results. You get an empty capture group. But. luckily the data I have always has either pairs of % signs, or just begins with a $. A dot or a subsequent $ or end of line, can be the end of what I need to capture if the thing starts with the $. I could always safeguard by checking length of group(1) before adding it to my results too. – JGFMK Nov 12 '19 at 15:45
  • @JGFMK If you need to avoid empty strings in the results all you need is a `+` quantifier in the pattern: `String pattern = "[%$]([^%.$]+)";`. I only used `*` because I followed the original pattern logic where `*?` was used. – Wiktor Stribiżew Nov 12 '19 at 17:21
  • @WiktorStribiżew The problem ended up being more complex. I posted another follow up question here: https://stackoverflow.com/questions/58827094/java-regex-capture-string-starting-with-single-dollar-but-not-when-it-has-two?noredirect=1#comment103930869_58827094 Wondered if you had any insights for that one? – JGFMK Nov 12 '19 at 22:03
  • @Holger Unfortunately, the current question did not reflect the real requirements. I posted a more sophisticated solution meeting more specific requirements [here](https://stackoverflow.com/a/58833692/3832970). – Wiktor Stribiżew Nov 13 '19 at 09:49
  • I know that there’s a follow-up question, however, there’s a general issue with the statement that optional matches at the end of a pattern are redundant. Whenever you are processing more than the first match, they are relevant. – Holger Nov 13 '19 at 09:52
  • @Holger This will only make a difference for consecutive matches, and my suggestion takes into account the OP logic. *Here*, `\1?` is redundant, period, I explained why in my answer. I do not say optional patterns at the end of a pattern are always redundant. – Wiktor Stribiżew Nov 13 '19 at 09:59