Why do I get the first capture group only?

Question

(https://stackoverflow.com/a/2304626/6607497 and https://stackoverflow.com/a/37004214/6607497 did not help me)

Analyzing a problem with /proc/stat in Linux I started to write a small utility, but I can't get the capture groups the way I wanted. Here is the code:

#!/usr/bin/perl
use strict;
use warnings;

if (open(my $fh, '<', my $file = '/proc/stat')) {
    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {
            print "$cpu $#vals\n";
        }
    }
    close($fh);
} else {
    die "$file: $!\n";
}

For example with these input lines I get the output:

> cat /proc/stat
cpu  2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106  ...

So the match actually works, but I don't get the capture groups into @vals (perls 5.18.2 and 5.26.1).

To summarize all solutions (my own excepted) so far: It seems you cannot do it with one regex only; instead you have to use a two-step process (like match, then split). — U. Windl, Jul 13 '20 at 06:33

zdim · Answer 1 · 2022-03-09T23:45:09.370

6

Only the last of the repeated matches from a single pattern is captured.

Instead, can just split the line and then check on -- and adjust -- the first field

while (<$fh>) {
    my ($cpu, @vals) = split;
    next if not $cpu =~ s/^cpu//;
    print "$cpu $#vals\n";
}

If the first element of the split's return doesn't start with cpu the regex substition fails and so the line is skipped. Otherwise, you get the number following cpu (or an empty string), as in OP.^†

Or, can use the particular structure of the line you process

while (<$fh>) {
    if (my ($cpu, @vals) = map { split } /^cpu([0-9]*) \s+ (.*)/x) { 
        print "$cpu $#vals\n";
    }
}

The regex returns two items and each is split in the map, except that the first one is just passed as is into $cpu (being either a number or an empty string), while the other yields the numbers.

Both these produce the needed output in my tests.

^† Since we always check for ^cpu (and remove it) it makes sense to do that first, and only then split -- when needed. However, that gets a little tricky for the following reason.

That bare split strips the leading (and trailing) whitespaces by its default, so for lines where cpu string has no trailing digits (cpu 2709779...) we would end up having the next number for what should be the cpu designation! A quiet error.

Thus we need to specify for split to use spaces, as it then leaves the leading spaces

while (<$fh>) {
    next if not s/^cpu//;
    my ($cpu, @vals) = split /\s+/;  # now $cpu may be space(s)
    print "$cpu $#vals\n";
}

This now works as intended as the cpu without trailing numbers gets space(s), a case to handle but clear. But this is misleading and an unaware maintainer -- or us the proverbial six months later -- may be tempted to remove the seemingly "unneeded" /\s+/, introducing an error.

edited Mar 09 '22 at 23:45

answered Jul 02 '20 at 07:58

zdim

64,580
5
52
81

Interesting variant, but harder to understand than my original code IMHO. What's the use of `/x` and why don't you use `/^cpu([0-9]*) (.*)$/`` – U. Windl Jul 02 '20 at 08:13
1

@U.Windl The "_original code_" ... doesn't actually do what you need? This does. The `/x` merely allows literal spaces inside (and comments, and newlines), for readability. It's not needed. I dropped `$` since it isn't needed, the `.*` matches to the end anyway. – zdim Jul 02 '20 at 08:15
@U.Windl "_harder to understand than my original code IMHO_" --- yes, absolutely agree, that second option is a little tricky. I like it as a curios way to do this. I'd recommend the first one. – zdim Jul 02 '20 at 08:18
In the first code sample, you are always testing that `$cpu` is the expected value. Since you are going to test that for every line, you could do that first then only split if that succeeds. – brian d foy Jul 02 '20 at 17:37
Assuming that the matching at the beginning is more efficient than the split, you could `next unless /^cpu/;` first, and then do the `split`. – U. Windl Jul 03 '20 at 05:51
@briandfoy By all means yes, and I considered that. Since I'm anyway parsing the line to check for `cpu` (and to capture the number) I decided to then split it once I'm at it. I don't claim it's better, I sort of bounced between the two. While checking first is a standard and simpler approach (and thus clearer), with `split` first I process the actual line once, and the purpose is clear enough (it does feel yucky to create `$cpu` and then perhaps find out that it ain't, also along with a whole array). It also depends on the typical expected use. – zdim Jul 03 '20 at 06:36
@U.Windl Efficiency considerations here can be tricky. There are optimizations that may depend on tiny details, for both. (Also, can't do just `/^cpu/` -- need to capture the following number as well, at least.) Then ... what about the rest of processing? How often does it go? That's critical since with checking first we then need to run the regex engine _again_ (or run `split` anyway), to get and store values. How does it all add up? Generally though, they're close enough I'd say. But, above all, this isn't of great concern here; just how many times does that file get read? – zdim Jul 03 '20 at 06:48

pii_ke · Answer 2 · 2020-07-02T08:22:23.523

2

Going by the example input, following content inside the while loop should work.

if (/^cpu(\d*)/) {
    my $cpu = $1;
    my (@vals) = /(?:\s+(\d+))+/g;
    print "$cpu $#vals\n";
}

edited Jul 02 '20 at 08:22

answered Jul 02 '20 at 07:37

pii_ke

2,811
2
20
30

That's basically what @Tim Biegeleisen said in his comment on https://stackoverflow.com/a/62690982/6607497. – U. Windl Jul 02 '20 at 08:08

brian d foy · Answer 3 · 2020-07-03T01:28:15.880

In an exercise for Learning Perl, we state a problem that's easy to solve with two simple regexes but hard with one (but then in Mastering Perl I pull out the big guns). We don't tell people this because we want to highlight the natural behavior to try to write everything in a single regex. Some of the contortions in other answers remind me of that, and I wouldn't want to maintain any of them.

First, there's the issue of only processing the interesting lines. Then, once we have that line, grab all the numbers. Translating that problem statement into code is very simple and straightforward. No acrobatics here because assertions and anchors do most of the work:

use v5.10;

while( <DATA> ) {
    next unless /\A cpu(\d*) \s /ax;
    my $cpu = $1;
    my @values = / \b (\d+) \b /agx;
    say "$cpu " . @values;
    }

__END__
cpu  2709779 13999 551920 11622773 135610 0 194680 0 0 0
cpu0 677679 3082 124900 11507188 134042 0 164081 0 0 0
cpu1 775182 3866 147044 38910 135 0 15026 0 0 0
cpu2 704411 3024 143057 37674 1272 0 8403 0 0 0
cpu3 552506 4025 136918 38999 160 0 7169 0 0 0
intr 176332106  ...

Note that the OP still has to decide how to handle the cpu case with no trailing digits. Don't know what you want to do with the empty string.

score 1 · Answer 4 · answered Jul 02 '20 at 07:18

1

Perl's regex engine will only remember the last capture group from a repeated expression. If you want to capture each number in a separate capture group, then one option would be to use an explicit regex pattern:

if (open(my $fh, '<', my $file = '/proc/stat')) {
    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)$/) {
            print "$cpu $#vals\n";
        }
    }
    close($fh);
} else {
    die "$file: $!\n";
}

answered Jul 02 '20 at 07:18

Tim Biegeleisen

502,043
27
286
360

1

The problem is that the number of numbers for the ` cpu`s varied over time, nad maybe more values will be added. I feel this is a deficiency in perl as the whole line is matched. – U. Windl Jul 02 '20 at 07:36
How about this: Use your current regex to assert that each line matches, then use a string split to isolate each number term as a separate element in an array? – Tim Biegeleisen Jul 02 '20 at 07:37

score 0 · Accepted Answer · answered Jul 02 '20 at 07:42

0

Replacing

    while (<$fh>) {
        if (my ($cpu, @vals) = /^cpu(\d*)(?:\s+(\d+))+$/) {

with

    while (<$fh>) {
        my @vals;
        if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+)(?{ push(@vals, $^N) }))+$/) {

does what I wanted (requires perl 5.8 or newer).

answered Jul 02 '20 at 07:42

U. Windl

3,480
26
54

1

That's a pretty cruel thing to do to the next programmer who has to look at that. You're reaching for a 30-ton excavator when a hand spade will get the job done. – brian d foy Jul 02 '20 at 17:57
I kind of disagree: I'm matching lines starting with `cpu\d*`, and then add all numbers following on a list (push onto array). Of course you'll have to understand what the syntax does. Admittedly I did not examine the performance of the regex. – U. Windl Jul 13 '20 at 08:15

score 0 · Answer 6 · answered Jul 04 '20 at 17:16

0

he's my example. I thought I'd add it because I like simple code. It also allows "cpu7" with no trailing digits.

#!/usr/bin/perl
use strict;
use warnings;

my $file = "/proc/stat";
open(my $fh, "<", $file) or die "$file: $!\n";
while (<$fh>) 
{
  if ( /^cpu(\d+)(\s+)?(.*)$/ ) 
  {
    my $cpu = $1; 
    my $vals = scalar split( /\s+/, $3 ) ;
    print "$cpu $vals\n";
  }
}
close($fh);

answered Jul 04 '20 at 17:16

hoffmeister

612
4
10

The original code tried to collect the numbers after `cpu#` as an array; you code simply adds it as scalar. – U. Windl Jul 13 '20 at 06:27

score -1 · Answer 7 · edited Jul 02 '20 at 18:26

-1

Just adding to Tim's answer:

You can capture multiple values with one group (using the g-modifier), but then you have to split the statement.

    if (my ($cpu) = /^cpu(\d*)(?:\s+(\d+))+$/) {
        my @vals= /(?:\s+(\d+))/g;
        print "$cpu $#vals\n";
    }

edited Jul 02 '20 at 18:26

brian d foy

129,424
31
207
592

answered Jul 02 '20 at 07:38

Georg Mavridis

2,312
1
15
23

That's basically what @Tim Biegeleisen said in his comment on https://stackoverflow.com/a/62690982/6607497. – U. Windl Jul 02 '20 at 08:10
He had a fixed number of capture groups. Your solution is the efficient (but a bit complicated) one without a fixed number of capture groups. – Georg Mavridis Jul 02 '20 at 08:32

Why do I get the first capture group only?

7 Answers7

Linked