I am trying parse through a flat file and aggregate some columns based on certain columns which are considered as keys. I do this by building a hash of array data structure. Once the HoA data structure is built I would iterate through the hash again and write the content to a new file. The code works fine in case of a small data but when it encounters a large data ( ~800Mb ) the script breaks with out of memory error. Below is the code snippet of my script. In reality, I would be parsing a data which has 140 columns. So each hash key would have an array with 100+ elements in it.
I did some research and found some posts where they recommended to store this data structure into disk using modules like DB_File
and DBM::Deep
but it was little hard for me to use it within my code. I felt little hard to understand their usage. Can someone please suggest me what would be the best way to handle this.
use strict;
use warnings;
use Data::Dumper;
my $header = <DATA>;
chomp $header;
my @ColHeader = split /\|/,$header;
my $j=0;
my %ColPos = map {$_ => $j++} @ColHeader;
print Dumper \%ColPos;
my %hash;
my @KeyCols = qw(col1 col2 col3);
my @AggrCols = qw(col4 col5 col6 col7 col9);
while(my $line = <DATA>) {
chomp $line;
my @rowData = split /\|/,$line;
my $Key = join ':',@rowData[@ColPos{@KeyCols}];
my $i=0;
foreach my $k(@rowData[@ColPos{@AggrCols}]) {
$hash{$Key}[$i++] += $k;
}
}
__DATA__
col1|col2|col3|col4|col5|col6|col7|col8|col9|col10|col11
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
a1|a2|a3|1|2|3|4|somedata|1|text|alpha
c1|c2|c3|1|2|3|4|somedata|1|text|alpha
b1|b2|b3|1|2|3|4|somedata|1|text|alpha
a1|a2|a3|1|2|3|4|somedata|1|text|alpha