23

I've been looking up regular expression tutorials trying to get the hang of them and was enjoying the tutorial in this link right up until this problem: http://regexone.com/lesson/12

I cannot seem to figure out what the difference between "matching" and "capturing" is. Nothing I write seems to select the text under the "Capture" section (not even .*).

Edit: Here is an example for the tutorial that confuses me: (.* (.*)) is considered correct and (.* .*) is not. Is this a problem with the tutorial or something I am not understanding?

asimes
  • 5,749
  • 5
  • 39
  • 76
  • 1
    matching is a yes no question, it matches or it doesn't. capturing returns part of the expression (the part that is in parens) you may capture several parts (which may be nested) and return a list – vish Jan 18 '14 at 05:43
  • I'm not sure what returning means in terms of a regex expression, I just understand matching – asimes Jan 18 '14 at 05:45
  • With "capturing" you tell the engine which parts of the match should be stored in some kind of register, so that you can use the value in the expression itself, or in some replacement value, depending on the function you are using the expression with. For example, `([a-z])\1` would match any repeating letter. The `(...)` indicate that you want to capture the value of this partial natch, and `\1` lets you access the first (and only in this case) captured value. Or in other words: `\1` matches whatever the first capturing group (`(...)`) matched. – Felix Kling Jan 18 '14 at 05:52
  • What programming language are you most familiar with? A clear example could be easily formed using your familiar programming language. – Bryan Elliott Jan 18 '14 at 05:55
  • I'm familiar with a few languages but I guess the only time I have seen regex in a programming language is in depreciated PHP – asimes Jan 18 '14 at 05:58

4 Answers4

24

Matching:

When engine matches a part of string or the whole but does return nothing.

Capturing:

When engine matches a part of string or the whole and does return something.

-- What's the meaning of returning?

When you need to check/store/validate/work/love a part of string that your regex matched it before you need capturing groups (...)

At your example this regex .*?\d+ just matches the dates and years See here

And this regex .*?(\d+) matches the whole and captures the year See here

And (.*?(\d+)) will match the whole and capture the whole and the year respectively See here

*Please notice the bottom right box titled Match groups

So returning....

1:

preg_match("/.*?\d+/", "Jan 1987", $match);
print_r($match);

Output:

Array
(
    [0] => Jan 1987
)

2:

preg_match("/(.*?\d+)/", "Jan 1987", $match);
print_r($match);

Output:

Array
(
    [0] => Jan 1987
    [1] => Jan 1987
)

3:

preg_match("/(.*?(\d+))/", "Jan 1987", $match);
print_r($match);

Output:

Array
(
    [0] => Jan 1987
    [1] => Jan 1987
    [2] => 1987
)

So as you can see at the last example, we have 2 capturing groups indexed at 1 and 2 in the array, and 0 is always the matched string however it's not captured.

revo
  • 47,783
  • 14
  • 74
  • 117
  • Why is `?` used? Is it to make the string prior to digits optional, to allow the digit capture group the capture all digits, not just the first one? This is my conclusion, after using regex101.com to compare between `(.*?(\d+))` and `(.*(\d+)`. I find it interesting that for the first case: `/.*?\d+/` (only entire string is to captured), that question mark does not do anything but is included. – Ben Butterworth Oct 09 '20 at 18:24
  • @BenButterworth you may find this helpful https://stackoverflow.com/q/2301285/1020526 – revo Oct 09 '20 at 22:08
6

capturing in regexps means indicating that you're interested not only in matching (which is finding strings of characters that match your regular expression), but you're also interested in using specific parts of the matched string later on.

for example, the answer to the tutorial you linked to would be (\w{3}\s+(\d+)).

now, why ?

to simply match the date strings it would be enough to write \w{3}\s+\d+ (3 word characters, followed by one or more spaces, followed by one or more digits), but adding capture groups to the expression (a capture group is simply anything enclosed in parenthesis ()) will allow me to later extract either the whole expression (using "$1", because the outer-most pair of parenthesis are the 1st the parser encounters) or just the year (using "$2", because the 2nd pair of parenthesis, around the \d+, are the 2nd pair that the regexp parser encounters)

capture groups come in handy when you're interested not only in matching strings to pattern, but also extracting data from the matched strings or modifying them in any way. for example, suppose you wanted to add 5 years to each of those dates in the tutorial - being able to extract just the year part from a matched string (using $2) would come in handy then

radai
  • 23,949
  • 10
  • 71
  • 115
  • 1
    Ok, thank you. I am assuming that the `$1`, `$2`, etc. is specific to PHP. In that case, can you point me to a reference of the PHP function I would use to produce these variables? Do you really have to access them as individual variables as opposed to an array? – asimes Jan 20 '14 at 06:08
  • 1
    the concept of capturing is common to all regexp implementations. specifically in php it seems the way to access capture groups is by accessing the `matches` array by the capture index. see the documentation for preg_match here - http://php.net/manual/en/function.preg-match.php – radai Jan 20 '14 at 06:13
4

In a nutshell, a "Capture" saves the collected value in a special place so you can access it later.

As some have pointed out, the captured stuff can be used 'later on' in the same pattern, so that

/(ab*c):\1/

will match ac:ac, or abc:abc, or abbc:abbc etc. The (ab*c) will match an a, any number of b, then a c. Whatever it DOES match is 'captured'. In many programming and scripting languages, the syntax like \1, \2 etc has the special meaning referring to the first, second, etc captures. Since the first one might be abbc, then the \1 bit has to match abbc only, thus the only possible full match would then be 'abbc:abbc'

Perl (and I think) PHP both allow the \1 \2 syntax, but they also use $1 $2 etc which is considered more modern. Many languages have picked up the powerful RegEx engine from Perl so there's increasing use of this in the world.

Since your sample question seems to be on a PHP site, the typical use of $1 in PHP is:

/(ab*c)(de*f)/

then later (eg next line of code)

$x = $1 . $2;   # I hope that's PHP syntax for concatenation!

So the capture is available until your next use of a regex. Depending on the programming language in use, those captured values may be smashed by the next pattern match, or they may be permanently available through special syntax or use of the language.

0

take a look at these 2 regex - from your example

# first
/(... (\d\d\d\d))/
#second
/... \d\d\d\d/

they both match "Jun 1965" and "May 2000"
(and incidentally many other things like "555 1234")

the second one just matches it - yesno

so you could say

if ($x=~/... \d\d\d\d/){do something}

the first one captures so

/(... (\d\d\d\d))/
print $1,";;;",$2

would print "Jun 1967;;;1967"

vish
  • 1,046
  • 9
  • 26
  • I just realized your `...` was not an ellipsis looking at it again so now it makes more sense. What about `$1` and `$2`? – asimes Jan 19 '14 at 16:59
  • @asimes `$1` and `$2` are known as "backreferences". They are special characters that hold any data matched inside a capture group. The basic syntax is a dollar-sign `$`, followed by a number equal to the order of the capture groups. In the case above, `$1` holds `... \d\d\d\d` because it aligns to the outermost pair of brackets, while `$2` holds just `\d\d\d\d` as it is aligned to the second pair of brackets. – Taylor Hx Jan 20 '14 at 06:00
  • The captured strings are stored in `$1`, `$2`, `$3` etc, as many as there are parentesized exprressions, counting the opening parentheses from the left. – tripleee Jan 20 '14 at 06:01
  • Proper "back references" come into play when the regex itself refers back to a match. For example, `(.).\1` will match "aba" and "bbb" but not "abb" or "abc"; the `\1` refers back to the first parenthesized expression, which is here the first character; so the third character has to be identical to the first. The group will also be captured in `$1`. – tripleee Jan 20 '14 at 06:08