1

just experiencing some problems with Java Regular expressions.
I have a program that reads through an HTML file and replaces any string inside the @VR@ characters, i.e. @VR@Test1 2 3 4@VR@

However my issue is that, if the line contains more than two strings surrounded by @VR@, it does not match them. It would match the leftmost @VR@ with the rightmost @VR@ in the sentence and thus take whatever is in between.

For example:

<a href="@VR@URL-GOES-HERE@VR@" target="_blank" style="color:#f4f3f1; text-decoration:none;" title="ContactUs">@VR@Google@VR@</a>    

My code would match

URL-GOES-HERE@VR@" target="_blank" style="color:#f4f3f1; text-decoration:none;" title="ContactUs">@VR@Google

Here is my Java code. Would appreciate if you could help me to solve this:

Pattern p = Pattern.compile("@VR@.*@VR@");
Matcher m;
Scanner scanner = new Scanner(htmlContent);

while (scanner.hasNextLine()) {
      String line = scanner.nextLine();
      m = p.matcher(line);

      StringBuffer sb = new StringBuffer();

      while (m.find()) {
           String match_found = m.group().replaceAll("@VR@", "");
           System.out.println("group: " + match_found);
      }
}

I tried replacing m.group() with m.group(0) and m.group(1) but nothing. Also m.groupCount() always returns zero, even if there are two matches as in my example above.

Thanks, your help will be very much appreciated.

Alek.k
  • 11
  • 1

2 Answers2

3

Your problem is that .* is "greedy"; it will try to match as long a substring as possible while still letting the overall expression match. So, for example, in @VR@ 1 @VR@ 2 @VR@ 3 @VR@, it will match 1 @VR@ 2 @VR@ 3.

The simplest fix is to make it "non-greedy" (matching as little as possible while still letting the expression match), by changing the * to *?:

Pattern p = Pattern.compile("@VR@.*?@VR@");

Also m.groupCount() always returns zero, even if there are two matches as in my example above.

That's because m.groupCount() returns the number of capture groups (parenthesized subexpressions, whose corresponding matched substrings retrieved using m.group(1) and m.group(2) and so on) in the underlying pattern. In your case, your pattern has no capture groups, so m.groupCount() returns 0.

ruakh
  • 175,680
  • 26
  • 273
  • 307
0

You can try the regular expression:

@VR@(((?!@VR@).)+)@VR@

Demo:

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("@VR@(((?!@VR@).)+)@VR@");

public static void main(String[] args) {
    String input = "<a href=\"@VR@URL-GOES-HERE@VR@\" target=\"_blank\" style=\"color:#f4f3f1; text-decoration:none;\" title=\"ContactUs\">@VR@Google@VR@</a> ";

    System.out.println(
        REGEX_PATTERN.matcher(input).replaceAll("$1")
    );  // prints "<a href="URL-GOES-HERE" target="_blank" style="color:#f4f3f1; text-decoration:none;" title="ContactUs">Google</a> "
}
Paul Vargas
  • 41,222
  • 15
  • 102
  • 148