I need to split data found in one file into other files based upon their job id. The structure of the file, and the two helper files, are:
# data.txt - data file - node,timestamp,data
1,1516,25
2,1845,24
3,1637,26
4,1342,74
5,1426,63
6,1436,23
7,1732,64
1,1836,83
2,1277,12
3,2435,62
4,2433,47
5,2496,52
6,2142,69
7,2176,53
# job.txt - job timing - job,startts,endts
1234,1001,2000
5678,2001,2500
# node.txt - node to job map - job,node
1234,1
1234,2
1234,3
1234,4
1234,5
5678,3
5678,4
5678,5
5678,6
5678,7
In order to map a line in the data file to its appropriate new file, two transformations must take place. First, the data timestamp must be used to determine which jobs are running. Second, the list of running jobs must be checked to determine which owns the node the data references. This is my solution:
use strict;
use warnings;
my @timing = ( );
my %nodes = ( );
my %handles = ( );
### array of arrays containing job, start time, and end time
open JOB, "<job.txt" or die "can't open jobfile, $!";
while (<JOB>) {
my @fields = split /,/; #/ stop SO highliter
my @array = ($fields[0], $fields[1], $fields[2]);
push @timing, \@array;
}
close JOB;
### map job -> array of nodes
open NID "<node.txt" or die "can't open nidfile";
while (<NID>) {
my @fields = split /,/; #/
if (!exists $nodes{$fields[0]}) { $nodes{$fields[0]} = (); }
push @{$nodes{$fields[0]}}, $fields[1];
}
close NID;
### split data
open DATA, "<data.txt" or die "Couldn't open file all.pow, $!";
while (<DATA>) {
my @fields = split /,/; #/
my @jobs = grep {$fields[1] >= $_->[1] && $fields[1] <= $_->[2]} @timing;
scalar @jobs > 0 or next;
my $jid = (grep {!exists $nodes{$fields[0]}} @jobs)[0][0];
### create and memoize file handles
if (!exists $handles{$jid}) {
open my $temp, ">$jid.txt" or die "Can't open jfile $jid, $!";
$handles{$jid} = $temp;
}
print {$handles{$jid}} "$fields[1],fields[2]";
}
close DATA;
I would like to know if there are any ways to increase the speed/efficiency of the data file loop. This has to run over large amounts of data, and so it needs to be as efficient as possible. I would also appreciate any comments on more idiomatic approaches: this is my first perl script (the array of references to arrays took quite a while to figure out).