215

I am trying to understand this code block. In the first one, what is it we are looking for in the expression?

My understanding is that it is any character (0 or more times *) followed by any number between 0 and 9 (one or more times +) followed by any character (0 or more times *).

When this is executed the result is:

Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0

Could someone please go through this with me?

What is the advantage of using Capturing groups?

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTut3 {

    public static void main(String args[]) {
        String line = "This order was placed for QT3000! OK?"; 
        String pattern = "(.*)(\\d+)(.*)";

        // Create a Pattern object
        Pattern r = Pattern.compile(pattern);

        // Now create matcher object.
        Matcher m = r.matcher(line);

        if (m.find()) {
            System.out.println("Found value: " + m.group(0));
            System.out.println("Found value: " + m.group(1));
            System.out.println("Found value: " + m.group(2));
        } else {
            System.out.println("NO MATCH");
        }
    }

}
informatik01
  • 16,038
  • 10
  • 74
  • 104
Xivilai
  • 2,481
  • 3
  • 15
  • 15
  • 1
    To insert a new line, place 2 spaces at the end of the line. More about markdown syntax: http://en.wikipedia.org/wiki/Markdown - See also:http://stackoverflow.com/editing-help – assylias Jul 31 '13 at 11:47

5 Answers5

300

The issue you're having is with the type of quantifier. You're using a greedy quantifier in your first group (index 1 - index 0 represents the whole Pattern), which means it'll match as much as it can (and since it's any character, it'll match as many characters as there are in order to fulfill the condition for the next groups).

In short, your 1st group .* matches anything as long as the next group \\d+ can match something (in this case, the last digit).

As per the 3rd group, it will match anything after the last digit.

If you change it to a reluctant quantifier in your 1st group, you'll get the result I suppose you are expecting, that is, the 3000 part.

Note the question mark in the 1st group.

String line = "This order was placed for QT3000! OK?";
Pattern pattern = Pattern.compile("(.*?)(\\d+)(.*)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    System.out.println("group 1: " + matcher.group(1));
    System.out.println("group 2: " + matcher.group(2));
    System.out.println("group 3: " + matcher.group(3));
}

Output:

group 1: This order was placed for QT
group 2: 3000
group 3: ! OK?

More info on Java Pattern here.

Finally, the capturing groups are delimited by round brackets, and provide a very useful way to use back-references (amongst other things), once your Pattern is matched to the input.

In Java 6 groups can only be referenced by their order (beware of nested groups and the subtlety of ordering).

In Java 7 it's much easier, as you can use named groups.

renadeen
  • 1,741
  • 17
  • 16
Mena
  • 47,782
  • 11
  • 87
  • 106
  • Thanks! Is the reason group 2 stored 0 because the entire line was consumed by the greedy quantifier which then backed off until it came into contact with one or more numbers. 0 satisfied this so the expression succeeded. I find the third group confusing, does that greedy quantifier also consume the entire line, but backs off until it finds the one or more numbers (\\d+) which is supposed to precede it? – Xivilai Jul 31 '13 at 14:22
  • @Xivilai let me fine-tune my explanation in my answer, just a sec. – Mena Jul 31 '13 at 14:31
  • Thats a good explanation. So the reluctant starts from the left and just takes the minimum whereas with the greedy, it will take as much as possible (starting from the right), only stopping before the last digit to satisfy that condition. The third group takes the rest. – Xivilai Jul 31 '13 at 14:46
  • @Xivilai more or less. It always starts from the left though in this case. [Here](http://docs.oracle.com/javase/tutorial/essential/regex/quant.html) is some more info about quantifiers. – Mena Jul 31 '13 at 15:06
  • 2
    You can use named capture groups in Java 5/6 with [`named-regexp`](http://tony19.github.com/named-regexp/index.html). –  Aug 02 '14 at 01:17
  • Why it's `while (matcher.find())` but not `if (matcher.find())`? – Weekend Apr 14 '21 at 06:20
  • 1
    @Weekend in this case, it's equivalent insofar as the loop will only run once with that string example. I think I used a `while` loop back then because it's a common usage for patterns that may repeat in the input, and a common mistake to use `if` instead of `while` in those cases. TL;DR `if` is safe to use when you're sure the pattern won't repeat / you need the first occurrence only / `while` covers every case. – Mena Apr 14 '21 at 06:44
20

This is totally OK.

  1. The first group (m.group(0)) always captures the whole area that is covered by your regular expression. In this case, it's the whole string.
  2. Regular expressions are greedy by default, meaning that the first group captures as much as possible without violating the regex. The (.*)(\\d+) (the first part of your regex) covers the ...QT300 int the first group and the 0 in the second.
  3. You can quickly fix this by making the first group non-greedy: change (.*) to (.*?).

For more info on greedy vs. lazy, check this site.

f1sh
  • 11,489
  • 3
  • 25
  • 51
6

Your understanding is correct. However, if we walk through:

  • (.*) will swallow the whole string;
  • it will need to give back characters so that (\\d+) is satistifed (which is why 0 is captured, and not 3000);
  • the last (.*) will then capture the rest.

I am not sure what the original intent of the author was, however.

fge
  • 119,121
  • 33
  • 254
  • 329
5

From the doc :

Capturing groups</a> are indexed from left
 * to right, starting at one.  Group zero denotes the entire pattern, so
 * the expression m.group(0) is equivalent to m.group().

So capture group 0 send the whole line.

Michael Laffargue
  • 10,116
  • 6
  • 42
  • 76
1

A small note for others: You can make a quantifier possessive by placing an extra + after it, or make it lazy/reluctant by adding extra ? after it. By default quantifiers are greedy.

So, the examples are:

*  is greedy 
*? is lazy  
*+ is possessive
Mikhail2048
  • 1,715
  • 1
  • 9
  • 26