Parsing a file by summing up different columns of each row separated by blank line

Question

I have a file input as below;

#

volume stats
start_time  1
length      2
--------
ID
0x00a,1,2,3,4
0x00b,11,12,13,14
0x00c,21,22,23,24

volume stats
start_time  2
length      2
--------
ID
0x00a,31,32,33,34
0x00b,41,42,43,44
0x00c,51,52,53,54

volume stats
start_time  3
length      2
--------
ID
0x00a,61,62,63,64
0x00b,71,72,73,74
0x00c,81,82,83,84

#

I need output in below format;

1 33    36  39  42
2 123   126 129 132
3 213   216 219 222

#

Below is my code;

#!/usr/bin/perl
use strict;
use warnings;
#use File::Find;

# Define file names and its location
my $input = $ARGV[0];

# Grab the vols stats for different intervals
open (INFILE,"$input") or die "Could not open sample.txt: $!";
my $date_time;
my $length;
my $col_1;
my $col_2;
my $col_3;
my $col_4;
foreach my $line (<INFILE>)
{

    if ($line =~ m/start/)
        {
            my @date_fields = split(/   /,$line);
            $date_time = $date_fields[1];
        }
    if ($line =~ m/length/i)
        {
            my @length_fields = split(/ /,$line);
            $length = $length_fields[1];
        }
    if ($line =~ m/0[xX][0-9a-fA-F]+/)
        {
            my @volume_fields = split(/,/,$line);
            $col_1 += $volume_fields[1];
            $col_2 += $volume_fields[2];
            $col_3 += $volume_fields[3];
            $col_4 += $volume_fields[4];
            #print "$col_1\n";
        }
    if ($line =~ /^$/)
        {
            print "$date_time $col_1 $col_2 $col_3 $col_4\n";
                $col_1=0;$col_2=0;$col_3=0;$col_4=0;
        }
}
close (INFILE);

#

my code result is;

1
 33 36 39 42
2
 123 126 129 132

#

BAsically, for each time interval, it just sums up the columns for all the lines and displays all the columns against each time interval.

[When you find yourself adding an integer suffix to variable names, think ***I should have used an array***.](https://stackoverflow.com/a/1829927/100754) — Sinan Ünür, May 14 '16 at 11:46

Sobrique · Accepted Answer · 2016-05-13T16:39:07.517

$/ is your friend here. Try setting it to '' to enable paragraph mode (separating your data by blank lines).

#!/usr/bin/env perl

use strict;
use warnings;

local $/ = ''; 

while ( <> ) {
    my ( $start ) = m/start_time\s+(\d+)/;
    my ( $length ) = m/length\s+(\d+)/;
    my @row_sum; 
    for ( m/(0x.*)/g )  {
        my ( $key, @values ) = split /,/; 
        for my $index ( 0..$#values ) {
           $row_sum[$index] += $values[$index];
        }
    }
    print join ( "\t", $start, @row_sum ), "\n";
}

Output:

1       33      36      39      42
2       123     126     129     132
3       213     216     219     222

NB - using tab stops for output. Can use sprintf if you need more flexible options.

I would also suggest that instead of:

my $input = $ARGV[0]; 
open (my $input_fh, '<', $input) or die "Could not open $input: $!";

You would be better off with:

while ( <> ) {

Because <> is the magic filehandle in perl, that - opens files specified on command line, and reads them one at a time, and if there isn't one, reads STDIN. This is just like how grep/sed/awk do it.

So you can still run this with scriptname.pl sample.txt or you can do curl http://somewebserver/sample.txt | scriptname.pl or scriptname.pl sample.txt anothersample.txt moresample.txt

Also - if you want to open the file yourself, you're better off using lexical vars and 3 arg open:

open ( my $input_fh, '<', $ARGV[0] ) or die $!;

And you really shouldn't ever be using 'numbered' variables like $col_1 etc. If there's numbers, then an array is almost always better.

Thanks Sobrique, it did serve my need. I wanted to ask one more query, If I have another file having entry of hex numbers in different line as below 0x00a 0x00b Then, how do I add only those rows columns. I added if condition as if ($_ eq /^0x00a/), but it did not work. The final output would look like. The third line won't come as it starts with '0x00c'. 1 33 36 39 42 2 123 126 129 132 — Buddy, May 28 '16 at 17:36

Sinan Ünür · Answer 2 · 2016-05-14T21:05:41.853

Basically, a block begins with start_time and ends with a line of of whitespace. If instead end of block is always assured to be an empty line, you can change the test below.

It helps to use arrays instead of variables with integer suffixes.

When you hit the start of a new block, record the start_time value. When you hit a stat line, update column sums, and when you hit a line of whitespace, print the column sums, and clear them.

This way, you keep your program's memory footprint proportional to the longest line of input as apposed to the largest block of input. In this case, there isn't a huge difference, but, in real life, there can be. Your original program was reading the entire file into memory as a list of lines which would really cause your program's memory footprint to balloon when used with large input sizes.

#!/usr/bin/env perl

use strict;
use warnings;

my $start_time;
my @cols;

while (my $line = <DATA>) {
    if ( $line =~ /^start_time \s+ ([0-9]+)/x) {
        $start_time = $1;
    }
    elsif ( $line =~ /^0x/ ) {
        my ($id, @vals) = split /,/, $line;
        for my $i (0 .. $#vals) {
            $cols[ $i ] += $vals[ $i ];
        }
    }
    elsif ( !($line =~ /\S/) ) {
        # guard against the possibility of
        # multiple blank/whitespace lines between records
        if ( @cols ) {
            print join("\t", $start_time, @cols), "\n";
            @cols = ();
        }
    }
}

# in case there is no blank/whitespace line after last record
if ( @cols ) {
    print join("\t", $start_time, @cols), "\n";
}

__DATA__
volume stats
start_time  1
length      2
--------
ID
0x00a,1,2,3,4
0x00b,11,12,13,14
0x00c,21,22,23,24

volume stats
start_time  2
length      2
--------
ID
0x00a,31,32,33,34
0x00b,41,42,43,44
0x00c,51,52,53,54

volume stats
start_time  3
length      2
--------
ID
0x00a,61,62,63,64
0x00b,71,72,73,74
0x00c,81,82,83,84

Output:

1  33  36  39  42
2   123 126 129 132
3   213 216 219 222

Is there any advantage in general writing `my ($id, @vals) = split ...` instead of using something like `@vals = splice [split...], 1` to remove a useless first item, or is it just more idiomatic? Other thing, take care that the file may not have a newline at the end. — Casimir et Hippolyte, May 14 '16 at 12:40
@CasimiretHippolyte `$id` might be useful ... if not, use `(undef, @vals)` instead of `splice` to avoid allocating memory for the first element. — Sinan Ünür, May 14 '16 at 13:17
`undef` obviously! I was looking for something like this but I didn't find a way to write it. Thanks. — Casimir et Hippolyte, May 14 '16 at 13:18

score 0 · Answer 3 · answered May 13 '16 at 16:27

When I run your code, I get warnings:

Use of uninitialized value $date_time in concatenation (.) or string

I fixed it by using \s+ instead of / /.

I also added a print after your loop in case the file does not end with a blank line.

Here is minimally-changed code to produce your desired output:

use strict;
use warnings;

# Define file names and its location
my $input = $ARGV[0];

# Grab the vols stats for different intervals
open (INFILE,"$input") or die "Could not open sample.txt: $!";
my $date_time;
my $length;
my $col_1;
my $col_2;
my $col_3;
my $col_4;
foreach my $line (<INFILE>)
{
    if ($line =~ m/start/)
        {
            my @date_fields = split(/\s+/,$line);
            $date_time = $date_fields[1];
        }
    if ($line =~ m/length/i)
        {
            my @length_fields = split(/\s+/,$line);
            $length = $length_fields[1];
        }
    if ($line =~ m/0[xX][0-9a-fA-F]+/)
        {
            my @volume_fields = split(/,/,$line);
            $col_1 += $volume_fields[1];
            $col_2 += $volume_fields[2];
            $col_3 += $volume_fields[3];
            $col_4 += $volume_fields[4];
        }
    if ($line =~ /^$/)
        {
            print "$date_time $col_1 $col_2 $col_3 $col_4\n";
            $col_1=0;$col_2=0;$col_3=0;$col_4=0;
        }
}
print "$date_time $col_1 $col_2 $col_3 $col_4\n";
close (INFILE);


__END__

1 33 36 39 42
2 123 126 129 132
3 213 216 219 222

Parsing a file by summing up different columns of each row separated by blank line

3 Answers3