2

I am trying parse through a flat file and aggregate some columns based on certain columns which are considered as keys. I do this by building a hash of array data structure. Once the HoA data structure is built I would iterate through the hash again and write the content to a new file. The code works fine in case of a small data but when it encounters a large data ( ~800Mb ) the script breaks with out of memory error. Below is the code snippet of my script. In reality, I would be parsing a data which has 140 columns. So each hash key would have an array with 100+ elements in it.

I did some research and found some posts where they recommended to store this data structure into disk using modules like DB_File and DBM::Deep but it was little hard for me to use it within my code. I felt little hard to understand their usage. Can someone please suggest me what would be the best way to handle this.

use strict;
use warnings;
use Data::Dumper;
my $header = <DATA>;
chomp $header;
my @ColHeader = split /\|/,$header;
my $j=0;
my %ColPos = map {$_ => $j++} @ColHeader;
print Dumper \%ColPos;
my %hash;
my @KeyCols = qw(col1 col2 col3);
my @AggrCols = qw(col4 col5 col6 col7 col9);

while(my $line = <DATA>) {
    chomp $line;
    my @rowData = split /\|/,$line;
    my $Key = join ':',@rowData[@ColPos{@KeyCols}];
    my $i=0;
    foreach my $k(@rowData[@ColPos{@AggrCols}]) {
        $hash{$Key}[$i++] += $k;    
    }


}


__DATA__
col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
a1|a2|a3|1|2|3|4|somedata|1|text|alpha
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
b1|b2|b3|1|2|3|4|somedata|1|text|alpha
a1|a2|a3|1|2|3|4|somedata|1|text|alpha
chidori
  • 1,052
  • 3
  • 12
  • 25
  • 1
    How many hash keys are you dealing with in your actual data set? – Zaid Oct 05 '16 at 17:08
  • My script breaks with out of memory while building data but the records would be somewhere around ~64 million – chidori Oct 05 '16 at 17:09
  • So the sample data you show in the `__DATA__` section in reality has ~64 million lines and ~140 columns, right? – PerlDuck Oct 05 '16 at 17:12
  • @PerlDog yes , that is right. – chidori Oct 05 '16 at 17:13
  • 2
    Is your input really separated with `|` characters and the key columns are the first three of them? Then please run this code: `cut -d '|' -f1-3 data.txt | sort -u | wc -l` (may take few secs) and tell the result. It counts the different combinations of col1|col2|col3 and thus the number of elements in your hash. – PerlDuck Oct 05 '16 at 17:52
  • 2
    Perhaps it's a 32 vs. 64 bit Perl version issue. Please also give the output of [`perl -V:archname`](http://stackoverflow.com/a/26172826/5830574). – PerlDuck Oct 05 '16 at 19:36
  • @PerlDuck It was indeed a problem with bit. switching to 64 bit solved the issue. – chidori Oct 26 '16 at 18:13

1 Answers1

1

This might be a solution, though I'm not real familiar with the DBD::CSV module. This approach would avoid having to load in you data to a hash and there should be no out-of-memory problems. You would have to have DBD::CSV and DBI installed.

(Note that I entered one more line of data, a1|a2|a4|1|2|3|4|somedata|1|text|alpha when testing.)

#!/usr/bin/perl
use strict;
use warnings;
use DBI;

my $dbh = DBI->connect("DBI:CSV:f_dir=.");
$dbh->{'csv_tables'}->{'data'} = { 'file' => 'j1.txt', 'csv_sep_char' => "|"};

my @KeyCols = qw(col1 col2 col3);
my @AggrCols = qw(col4 col5 col6 col7 col9);
my @sums = map "sum($_)", @AggrCols;

my $sql;
{
    local $" = ',';
    $sql = <<SQL;
    select @KeyCols, @sums
    from data
    group by @KeyCols
SQL
}

my $sth = $dbh->prepare( $sql );
$sth->execute;

while ( my $row = $sth->fetchrow_arrayref ) {
    print "@$row\n";
}

__END__
*** contents j1.txt

col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
a1|a2|a3|1|2|3|4|somedata|1|text|alpha
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
b1|b2|b3|1|2|3|4|somedata|1|text|alpha
a1|a2|a3|1|2|3|4|somedata|1|text|alpha
a1|a2|a4|1|2|3|4|somedata|1|text|alpha

Output from this was:

c1 c2 c3 3 6 9 12 3
b1 b2 b3 1 2 3 4 1
a1 a2 a3 2 4 6 8 2
a1 a2 a4 1 2 3 4 1

How you want your output displayed is up to you. For this example, the data is space separated, probably not what you would want.

Update: I may be wrong about the memory use especially with this large a file. It possibly could run out of memory. Sorry if this is not a help.

Chris Charley
  • 6,403
  • 2
  • 24
  • 26