Regex Grouping with Repeat Operators

Question

I am using groups to try to match on a certain pattern, and am not getting quite the results I expect. The pattern of interest are as follows:

([0-9]+(\.[0-9]+)+)

For string 1.23, I get $1=1.23, and $2=.23 which makes sense to me.

But for string 1.2.3, I get $1=1.2.3 and $2=.3, where I would expect $2=.2.3, because its group is a decimal point and a digit, repeated.

Can someone please explain to me how this works? Thank you!

You're close. To get what you are after in `$2`, you need another set of parentheses. See my answer below. — DavidRR, Dec 03 '13 at 16:16

score 4 · Answer 1 · answered Dec 03 '13 at 14:41

4

When you use capturing groups with a quantifier, only the last repetition of the captured pattern will be stored.

answered Dec 03 '13 at 14:41

Hunter McMillen

59,865
24
119
170

score 3 · Accepted Answer · answered Dec 03 '13 at 14:41

"These pattern match variables are scalars and, as such, will only hold a single value. That value is whatever the capturing parentheses matched last."

http://blogs.perl.org/users/sirhc/2012/05/repeated-capturing-and-parsing.html

In you example, $1 matches 1.2.3. As the pattern repeats, $2 would be set to .2 until the final match of .3

score 3 · Answer 3 · edited May 23 '17 at 11:49

3

Perhaps this regex will meet your needs:

\b(\d+)((?:\.\d+)+)\b

This regex separates the leading integer sequence from its repeating fractional components.

(As indicated by @ysth, please keep in mind that \d may match more characters than you intend. If that is the case, use the character class [0-9] instead or use the /a modifier.)

Here's a Perl program that demonstrates this regex on a sample data set. (Also see the live demo.)

#!/usr/bin/perl -w

use strict;
use warnings;

while (<DATA>) {
    chomp;

    # A - A sequence of digits
    # B - A period and a sequence of digits
    # C - Repeat 'B'.

    if (/\b(\d+)((?:\.\d+)+)\b/) {
#           ^^^     ^^^^^
#            A        B
#                   ^^^^^^^
#                      C

        print "[$1]  [$2]\n";
    }
}

__END__
1.23
123.456
1.2.3
1.22.333.444

Expected Output:

[1]  [.23]
[123]  [.456]
[3]  [.2.3]
[4]  [.22.333.444]

edited May 23 '17 at 11:49

Community

1
1

answered Dec 03 '13 at 16:10

DavidRR

18,291
25
109
191

1

changing `[0-9]` to `\d` matches a whole lot more characters (unless you also use the /a flag) – ysth Dec 03 '13 at 16:13
@ysth - [Does “\d” in regex mean a digit?](http://stackoverflow.com/a/6479605/1497596). Does that answer fully define what you mean by "a whole lot more characters"? – DavidRR Dec 03 '13 at 16:31
From the [PerlRE doc](http://perldoc.perl.org/perlre.html): `/d`, `/u` , `/a` , and `/l` , available starting in **5.14**, are called the **character set modifiers**; they affect the character set semantics used for the regular expression. ... The `/a` modifier, on the other hand, may be useful. Its purpose is to allow code that is to work mostly on **ASCII** data to not have to concern itself with Unicode. – DavidRR Dec 03 '13 at 16:37

Regex Grouping with Repeat Operators

3 Answers3