I am mostly a Matlab user and a Perl n00b. This is my first Perl script.
I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.
Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.
Param 1 filter = ALL_VALUES
Param 2 filter = ALL_VALUES
Param 3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1.1 Param 1 UI 5
2.23 Param 3 TXT Some Text 1
3.2 Param 1 UI 10
4.5 Param 2 D 2.1234
5.3 Param 1 UI 15
6.121 Param 2 D 3.1234
7.56 Param 3 TXT Some Text 2
The basic logic of my script is to:
- Read until the ---- line to build list of parameters to extract (always has "filter =").
- Use the --- line to determine field widths. It is broken by spaces.
- For each parameter build time and data array (while nested inside of foreach param)
- In
continue
block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).
Here is my script:
#!/usr/bin/perl
$lineArg1 = @ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";
my $line;
my $data_start_pos;
my @param_name;
my @template;
while ($line = <INFILE>) {
chomp $line;
if ($line =~ s/\s+filter = ALL_VALUES//) {
$line = =~ s/^\s+//;
$line =~ s/\s+$//;
push @param_name, $line;
}
elsif ($line =~ /^------/) {
@template = map {'A'.length} $line =~ /(\S+\s*)/g;
$template[-1] = 'A*';
$data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $template = "@template";
my @lineData;
my @param_data;
my @param_time;
my $data_type;
foreach $current_param (@param_name) {
@param_time = ();
@param_data = ();
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) {
if($line =~ /$current_param/) {
chomp($line);
@lineData = unpack $template, $line;
push @param_time, @lineData[0];
push @param_data, @lineData[3];
}
} # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
$data_type = @lineData[2];
print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time
print BINOUT pack('d*', @param_time); #Write TimeStamps
print TOCOUT tell(BINOUT).","; #offset to end of time/data start
if ($data_type eq "TXT") {
print BINOUT pack 'A*', join("\n",@param_data);
}
elsif ($data_type eq "D") {
print BINOUT pack('d*', @param_data);
}
elsif ($data_type eq "UI") {
print BINOUT pack('L*', @param_data);
}
print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);
So my questions to you good people of the web are as follows:
- What am I obviously screwing up? Syntax, declaring variables when I don't need to, etc.
- This is probably slow (guessing) because of the nested loops and searching the line by line over and over again. Is there a better way to restructure the loops to extract multiple lines at once?
- Any other speed improvement tips you can give?
Edit: I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.