calculate average of small parts of column, grouped by key in perl?

Question

This question is quite similar to this one How can I get the average and standard deviations grouped by key? but I don't manage to modify it to fit my problem.

I have a lot of files (.csv) with 7 columns, the last three columns look like this:

col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234
1409,2,345
1409,3,456
1409,4,567
1409,5,678
1409,6,789
...
N,0,123
N,1,234
N,2,345
N,3,456
N,4,567
N,5,678
N,6,789

What I want to do is to calculate the average of the last column (col7) for all the values that have the same value in column 5 (col5), so 1408, 1409, 1410, ... until N and I don't know N. I want to print this average value next to the line (in col8) which contains a 3 in column 6 (col6). Do note that the value in column 6 (col6) goes from 0 to 6, but the first number of the file is not always 0. So what I want is:

col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456
...
bla,bla,bla,bla,N,3,456,456

I have some script I can use to calculate the average, but I have to be able to put my values into an array for that. Below is what I tried to do, but it doesn't work. Also, I'm just trying to learn Perl on my own, so if it looks like crap, I'm just trying!

    open (FILE, "<", $dir.$file) or die;
    my @lines = <FILE>;
    foreach my $line(@lines) {
        my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
        push @arrays5, $col5;
    }

    foreach my $array5(@arrays5) {            
        foreach my $line(@lines) {
            my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split(/\,/, $line);
            if ($array5 == $col5) {
                push @arrays7, $col7;
            }
        }
    }
close(FILE);

the $tmp_line was a mistake, corrected it now. I don't need columns 1 - 4 to calculate the average, but I also want to print them in the end, I will correct it in my question! — Nuttieke, Apr 26 '12 at 12:17
Fair enough. Is it possible that two, conflicting lines of data exist such as "1408,3,345" and "1408,3,999"? If so, what do you wish done in this case? — thb, Apr 26 '12 at 12:20
All right. @Tuxuday's answer below is interesting, and I'll not comment further on it, but you can give it a try. In the meantime, let me work on this a little and see what I can come up with. — thb, Apr 26 '12 at 12:30

Birei · Answer 1 · 2012-04-27T12:24:12.943

One way using Text::CSV_XS module. It's not a built-in one, so it has to be installed from CPAN or similar tool.

Content of script.pl:

use warnings;
use strict;
use Text::CSV_XS;

my ($offset, $col_total, $row3, $rows_processed);

## Check arguments to the script.
die qq[Usage: perl $0 <input-file>\n] unless @ARGV == 1;

## Open input file.
open my $fh, q[<], shift or die qq[Open error: $!\n];

## Create the CSV object.
my $csv = Text::CSV_XS->new or  
        die qq[ERROR: ] . Text::CSV_XS->error_diag();

## Read file content seven lines each time.
while ( my $rows = $csv->getline_all( $fh, $offset, 7 ) ) { 

        ## End when there is no more rows.
        last unless @$rows;

        ## For each row in the group of seven...
        for my $row ( 0 .. $#{$rows} ) { 

                ## Get value of last column.
                my $last_col_value = $rows->[ $row ][ $#{$rows->[$row]} ];

                ## If last column is not a number it is the header, so print it
                ## appending the eigth column and read next one.
                unless ( $last_col_value =~ m/\A\d+\Z/ ) { 
                        $csv->print( \*STDOUT, $rows->[ $row ] );
                        printf qq[,%s\n], q[col8];
                        next;
                }   

                ## Acumulate total amount for last column.
                $col_total += $last_col_value;

                ## Get third row. The output will be this row with the
                ## average appended.
                if ( $rows->[ $row ][-2] == 3 ) { 
                        $row3 = [ @{ $rows->[ $row ] } ];
                }   

                ## Count processed rows.
                ++$rows_processed;
        }   

        ## Print row with its average.
        if ( $rows_processed > 0  && ref $row3 ) { 
                $csv->print( \*STDOUT, $row3 );
                printf qq[,%g\n], $col_total / $rows_processed;
        }   

        ## Initialize variables.
        $col_total = $rows_processed = 0;
        undef $row3;
}

Content of infile:

col1,col2,col3,col4,col5,col6,col7
bla,bla,bla,bla,1408,1,123
bla,bla,bla,bla,1408,2,234
bla,bla,bla,bla,1408,3,345
bla,bla,bla,bla,1408,4,456
bla,bla,bla,bla,1408,5,567
bla,bla,bla,bla,1408,6,678
bla,bla,bla,bla,1409,0,123
bla,bla,bla,bla,1409,1,234
bla,bla,bla,bla,1409,2,345
bla,bla,bla,bla,1409,3,456
bla,bla,bla,bla,1409,4,567
bla,bla,bla,bla,1409,5,678
bla,bla,bla,bla,1409,6,789

Run it like:

perl script.pl infile

With following output:

col1,col2,col3,col4,col5,col6,col7,col8
bla,bla,bla,bla,1408,3,345,400.5
bla,bla,bla,bla,1409,3,456,456

I will try this one too, the Text::CSV_XS is not installed though. — Nuttieke, Apr 27 '12 at 07:40
I don't get it comepletely, get an error: Expected fields to be an array ref at script.pl line 57, <$fh> line 7. — Nuttieke, Apr 27 '12 at 09:28
@Nuttieke: I've edited the script to add a special check over `$row3` variable. I'm assuming because I can not reproduce your error. — Birei, Apr 27 '12 at 12:27

tuxuday · Answer 2 · 2012-04-27T10:59:58.300

0

This should do the trick. Replace Cols[index] appropriately.

    use Data::Dumper ;
    open (FILE, "<", '/tmp/myfile') or die;
    my @lines ;
    my (%Sum,%Count);

    chomp(@lines = <FILE>);
    foreach my $line(@lines) {
        next if $line =~ /col/;
        my @Cols = split /,/, $line;
        $Sum{$Cols[0]} +=  $Cols[2] ;
        $Count{$Cols[0]}++;
    }

    foreach my $line(@lines) {
        if($line=~/col/) {
            print "$line,colX\n" ;
            next;
        }

        my @Cols = split /,/, $line;
        if($Cols[1]==3) {
            print "$line,",$Sum{$Cols[0]}/$Count{$Cols[0]},"\n" ;
        } else {
            print "$line,-1\n";
        }
    }

Sample input /tmp/myfile

col5,col6,col7
1408,1,123
1408,2,234
1408,3,345
1408,4,456
1408,5,567
1408,6,678
1409,0,123
1409,1,234

Sample output

col5,col6,col7,colX
1408,1,123,-1
1408,2,234,-1
1408,3,345,400.5
1408,4,456,-1
1408,5,567,-1
1408,6,678,-1
1409,0,123,-1
1409,1,234,-1

edited Apr 27 '12 at 10:59

answered Apr 26 '12 at 12:20

tuxuday

2,977
17
18

Hmm, I tried it but I get some errors: Use of uninitialized value in addition (+) at test.pl line 10, line 1886. It's got me puzzled.. – Nuttieke Apr 26 '12 at 12:49
Yes its a **warning** shouldn't be an **error** per se. Its coz line 10, **$Result{$Cols[0]} += $Cols[2] ;**, isn't initializing before addition. Did you copied the code as it is and ran - modifying only **index of @Cols**? – tuxuday Apr 26 '12 at 13:00
Did you managed to bring avg and meet other requirements with the help of above code? – tuxuday Apr 26 '12 at 14:00
Not sure how to print the results as a csv-file again with Dumper. – Nuttieke Apr 27 '12 at 09:41
It works quite good! But, the average is now printed on a new line. I have tried chomping $line, but does not make a difference. – Nuttieke Apr 27 '12 at 10:53
What you mean by newline? Post sample input/output. – tuxuday Apr 27 '12 at 10:57
I have added sample input/output to my post. – tuxuday Apr 27 '12 at 11:00
In your sample output I get the -1 and 400.5 in a new line. As if there would be a \n between $line and ,$Sum{$Cols[0]}/$Count{$Cols[0]}, – Nuttieke Apr 27 '12 at 11:05
Assign Avg calculation to a var, then do **print "$line,$Avg\n";**. Which OS? This is for your work or homework? – tuxuday Apr 27 '12 at 11:16
I already tried that, also limited the decimal numbers to 3. Unix – Nuttieke Apr 27 '12 at 11:20
Impossible that you are getting newlines there. You are running with my input sample or yours? – tuxuday Apr 27 '12 at 11:24
I fixed it. I tried it on my data, but apparently because the csv was created on windows I had to use regex s/\s+$//. Thanks again for your help! – Nuttieke Apr 27 '12 at 11:38

thb · Accepted Answer · 2012-04-26T13:32:25.913

0

Before we try to complete the answer, would you try this and tell me how close it comes to what you want?

#!/usr/bin/perl
use warnings;
use strict;

my $target = 3;

my %summary;

while(<>) {
    chomp;
    my ($col1,$col2,$col3,$col4,$col5,$col6,$col7) = split /\,/;
    $summary{$col5}{total} += $col7;
    ++$summary{$col5}{count};
    $summary{$col5}{line} = $_ if $col6 == $target;
}

$summary{$_}{average} = $summary{$_}{total} / $summary{$_}{count}
    for keys %summary;

print "${summary{$_}{line}},${summary{$_}{average}}\n"
    for sort keys %summary;

If close enough, then you may wish to finish on your own. If not, then we can discuss the matter further.

Note that you can replace the <> with <FILE> if you prefer to read from your data file rather than from standard input.

IMPLEMENTATION NOTES

The code relies on Perl's autovivification feature. Observe for instance the line ++$summary{$col5}{count};, which seems initially to increment a nonexistent counter. However, this is actually standard Perl idiom. If you try to do something arithmetical (like incrementation) to an object that does not exist, Perl implicitly creates the object, initializes it to zero, and then does the thing you wanted (like incrementation) to it.

It would probably be unwise for a more sober programming language like C++ to autovivify, but years of experience suggest that autovivification strikes the right balance between order and convenience in a slightly less sober language like Perl.

On a more elementary level, the code will probably make sense only to those used to Perl's hashes. However, if you've not used Perl's hashes before, this would be as good a chance as any to learn them. The hash is a central pillar of the language, and the above makes a fairly typical example of its use.

In this case, we have a hash of hashes, which again is fairly typical.

edited Apr 26 '12 at 13:32

answered Apr 26 '12 at 12:46

thb

13,796
3
40
68

Thank you. I have tried it, but I get an empty file, but I think this goes into the right direction. I thought I needed to use hashes, but I really did not know how to.. – Nuttieke Apr 26 '12 at 13:04
Good. Regarding the empty file, my code reads from standard input, not from your file. To get it to read from your file, you'll want to replace the `<>` with `` (and of course you'll have to have opened the `FILE` first, as your original code already does). – thb Apr 26 '12 at 13:10
Yes, I did change it. I will try again! – Nuttieke Apr 26 '12 at 13:25
It works! It prints the average first and then the rest. Is that because of the sort-thingie? – Nuttieke Apr 26 '12 at 13:38
Ow, it works but not for all lines. In some lines he gives "averagevalue,col5,col6,col7" other lines show: "averagevalue,col3,col4,col5,col6,col7" and all the lines seem to start with a comma – Nuttieke Apr 26 '12 at 13:45
You are correct. My code is a simplified example, meant to push you in the right direction. It does the main part of the job, but it does not handle the title line or (as far as I know, for I've not tried it) the "N" lines. My code loses teaching value if I clutter it with special-case handling! Regarding the sort, that's nothing. It's just to output in sorted order of column 5. If you prefer a different order, please advise, for we can do whatever order you want with a little additional coding. – thb Apr 26 '12 at 13:55
Thank you very much, I wanted to try with hashes, you made it more clear to me! – Nuttieke Apr 27 '12 at 10:54

calculate average of small parts of column, grouped by key in perl?

3 Answers3