1

In a Java program, I want to find out all the occurrences in a given String of these substrings: $$, or $\d (the symbol '$' followed by an integer).

My problem started when I added an additional constraint stating that a match occurs only if the matched string is not part of a substring limited by certain sequence of characters.

For example, I want to ignore the matches if they are part of a substring surrounded by "/{" and "/}".

The following example brings all the occurrences of $$, or $\d, but does not considere the additional constraint of ignoring the match if it is inside "/{" and "/}".

public static final String PARAMETERS_PREFIX = "$";
public static final String ALL_PARAMS_SUFFIX = "$";
public static final String BEGIN_JAVA_EXPRESSION = "/{";
public static final String END_JAVA_EXPRESSION = "/}";
...
String test = "$1 xxx $$ " //$1 and $$ are matches
  + BEGIN_JAVA_EXPRESSION + "xxx $2 xxx" + END_JAVA_EXPRESSION; //$2 SHOULD NOT be a match
Set<String> symbolsSet = new LinkedHashSet<String>();
Pattern pattern = Pattern.compile(Pattern.quote(PARAMETERS_PREFIX)+"(\\d+|"+Pattern.quote(ALL_PARAMS_SUFFIX)+")");
Matcher findingMatcher = pattern.matcher(test);
while(findingMatcher.find()) {
  String match = findingMatcher.group();
  symbolsSet.add(match);
}
return new ArrayList<String>(symbolsSet);

In addition to find the keywords that are not part of certain substring, I want to be able to replace afterwards only those keywords by certain values. Then, the option of just removing everything between the delimited characters before doing the match probably will not help, since afterwards I need to be able to get the original string with the matched tokens replaced by certain values, and the tokens inside the delimited region should be left without modifications. This should be easy if I found the right regex.

Does someone could give me a hint about how to write the right regex for this problem ?.

Sergio
  • 8,532
  • 11
  • 52
  • 94

5 Answers5

2

Is it permissible to use more than 1 regex? It might be less sexy, but you could do this with 3 regex's pretty easily. (these are not the actual regex's) 1. For getting the string you are looking for ($$ | ${num}) 2. For '/{' 3. For '/}'

It should be fairly easy to match up the invalid areas in 2 and 3. You can then use those spans to eliminate results in 1

ControlAltDel
  • 33,923
  • 10
  • 53
  • 80
  • +1 Thanks for your answer @ControlAltDel. However, I forgot to say in the question that afterwards I need to replace the matched strings with other values (I have just updated it). I think having only one regex expression would be the ideal. Not sure how to do the replacing if for finding the keywords I used more than one regex. – Sergio May 02 '12 at 19:31
  • I don't see how finding the locations using the 3 regexs I suggested is any more complicated than with 1 regex. But you know your problem better than I do... – ControlAltDel May 02 '12 at 19:34
  • What I want to say is that I am looking for one regex that I could use both for finding the matches, and for replacing afterwards the tokens for certain values. So I could do something like test.replaceAll("killerRegex", "newValue"). Where "killerRegex" will ignore matches in the surrounded areas. – Sergio May 02 '12 at 19:43
  • I'm not sure it's possible to do with just 1 regex - I definitely don't know how to do it. But at this point you're probably wasting time waiting for an answer here - just use the indexes you find the way I recommended, then use String.substring and append in your replacement string manually. It's a little tougher than replaceAll but not at all burdensome – ControlAltDel May 02 '12 at 19:53
  • @ControlAltDel I think you are right, take a look at my answer. I work with index of `/{`. I hope it will be enought for OP. – alain.janinm May 02 '12 at 20:12
1

I recommend using multiple regular expressions for this. Trying to do it all at once – though enticing – seems to be pretty messy.

  1. Remove your "Java Expressions" from the String: /{.*?/}
  2. Run your matcher on the resulting String: \$(?:\d+)?

Note: I was lazy on the first expression, so it assumes that any occurrence of /{ will be followed eventually by /} and without nesting.

Brendan
  • 1,853
  • 11
  • 18
1

The first part that you need can be achieved using this regex:

(?<!/{)\($[$|\d])(?!}/)

So, after running this you'll get all your matches in groups - from now on you can get Java to do the hard work by evaluating the match in the group and finding an appropriate replacement.

You should be able to use backreference somehow to do the replacement bit but I guess you can figure it out.

UPDATE:

(?<!/{) - it's a negative lookbehind - it says: from the current position assert that the previous characters are not /{. If this evaluates to true the match for /{ is discarded and the real matching begins. Lookahead/lookbehind are zero-width assertions which don't participate in the match.

(?!}/) - similarly but in the other direction - from the current position assert that the following characters are not }/. These also don't participate in the match. So effectively if these conditions are met, your match will still be just the text within the assertions, i.e. $$ or $\d.

Btw: it's possible that you'd need to escape some characters - the ones I remember are { and $ outside character class

(?<!/\{)\(\$[$|\d])(?!}/)

see also: How to escape text for regular expression in Java

Community
  • 1
  • 1
Joanna Derks
  • 4,033
  • 3
  • 26
  • 32
  • thanks @Joanna. The "\" between the first two parentheses is a mistake ? I do not see what are you escaping with it. – Sergio May 02 '12 at 21:30
  • @Sergio - in the first brackets I used `\` to escape `{` which otherwise might have been interpreted as part of this structure that you can use to specify the number of occurences of a character `{2, 3}` means at least twice and at most three times – Joanna Derks May 03 '12 at 12:03
  • Hi @Joanna, I was talking about the "\" between "(?<!/\{)" and "(\$[$|\d])", I did not get what you mean with that. – Sergio May 03 '12 at 12:16
  • @Sergio - ah, this one - then I agree, I don't know where it comes from either, must have been added on error as I definitely don't want to match an opening bracket ;) – Joanna Derks May 03 '12 at 12:58
0

I'm not sure you can do that with one regex. If no one can provide this ultimate regex I made a little workaround :

public static final String PARAMETERS_PREFIX = "$";
public static final String ALL_PARAMS_SUFFIX = "$";
public static final String BEGIN_JAVA_EXPRESSION = "/{";
public static final String END_JAVA_EXPRESSION = "/}";

    String test = "$1 xxx $$ " //$1 and $$ are matches
    + BEGIN_JAVA_EXPRESSION + "xxx $2 xxx" + END_JAVA_EXPRESSION; //$2 SHOULD NOT be a match
    Set<String> symbolsSet = new LinkedHashSet<String>();
    Pattern pattern = Pattern.compile(Pattern.quote(PARAMETERS_PREFIX)+"(\\d+|"+Pattern.quote(ALL_PARAMS_SUFFIX)+")");
    Matcher findingMatcher = pattern.matcher(test);
    while(findingMatcher.find()) {
        String match = findingMatcher.group(0);
        int idx= findingMatcher.start();
        int bexIdx = test.lastIndexOf(BEGIN_JAVA_EXPRESSION,idx);
        if(bexIdx!=-1){
            int endIdx = test.indexOf(END_JAVA_EXPRESSION,bexIdx);
            if(endIdx<idx){
                symbolsSet.add(match);
            }
        }
        else{
            symbolsSet.add(match);
        }
    }
alain.janinm
  • 19,951
  • 10
  • 65
  • 112
0

You can use a Pattern with Lookaround:

(?<!\\{[^\\}]{0,100})\\$(\\d|\\$)(?![^\\{]*\\})

  • (?<!\\{[^\\}]{0,100}): group used to restrict a predecessor value.

    This use negative lookbehind: {?<!X}, where X is a regex expression that can't precede the main expression. In Java, you can't use negative lokbehind without a obvious maximum length, then you can't use \\{.*. You could use Integer.MAX_VALUE, ou testString.length(). Another thing: you must check if you found a end symbol before start symbol. Therefore the expression is [^\\}] instead of ..

  • \\$(\\d|\\$): The main group sought.

    The usually here.

  • (?![^\\{]*\\}): group used to restrict a posterior value

    This use negative lookahead: {?!X}, where X is a regex expression that can't succeed the main expression. Here, you can use no-fix length. Again, you must check if you found a start symbol of substring. Then, you use [^\\{]* instead of .*.

However, add more constraints will add more complexity in your regex.


String to test the pattern: "$1 xx3x $$ /{xxx $2 xxx/} $4"