Parsing unsorted data from large fixed width text

Question

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.

I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.

Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.

Param 1   filter = ALL_VALUES
Param 2   filter = ALL_VALUES
Param 3   filter = ALL_VALUES

Time                     Name     Ty  Value                   
---------- ---------------------- --- ------------
1.1        Param 1                UI  5           
2.23       Param 3                TXT Some Text 1 
3.2        Param 1                UI  10          
4.5        Param 2                D   2.1234     
5.3        Param 1                UI  15         
6.121      Param 2                D   3.1234     
7.56       Param 3                TXT Some Text 2

The basic logic of my script is to:

Read until the ---- line to build list of parameters to extract (always has "filter =").
Use the --- line to determine field widths. It is broken by spaces.
For each parameter build time and data array (while nested inside of foreach param)
In continue block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).

Here is my script:

#!/usr/bin/perl

$lineArg1 = @ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";

my $line;
my $data_start_pos;
my @param_name;
my @template;
while ($line = <INFILE>) {
    chomp $line;
    if ($line =~ s/\s+filter = ALL_VALUES//) {
       $line = =~ s/^\s+//;
       $line =~ s/\s+$//;
       push @param_name, $line;
    }
    elsif ($line =~ /^------/) {
        @template = map {'A'.length} $line =~ /(\S+\s*)/g;
        $template[-1] = 'A*';        
        $data_start_pos = tell INFILE;
        last; #Reached start of data exit loop
    }
}
my $template = "@template";
my @lineData;
my @param_data;
my @param_time;
my $data_type;
foreach $current_param (@param_name) {
    @param_time = ();
    @param_data = ();    
    seek(INFILE,$data_start_pos,0); #Jump to data start
    while ($line = <INFILE>) {
        if($line =~ /$current_param/) {      
           chomp($line);
           @lineData = unpack $template, $line;
           push @param_time, @lineData[0];   
           push @param_data, @lineData[3];
        }       
    } # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
        $data_type = @lineData[2];
        print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time        
        print BINOUT pack('d*', @param_time);  #Write TimeStamps
        print TOCOUT tell(BINOUT).","; #offset to end of time/data start
        if ($data_type eq "TXT") {
            print BINOUT pack 'A*', join("\n",@param_data);
        }
        elsif ($data_type eq "D") {
            print BINOUT pack('d*', @param_data);
        }
        elsif ($data_type eq "UI") {
            print BINOUT pack('L*', @param_data);
        }        
        print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);

So my questions to you good people of the web are as follows:

What am I obviously screwing up? Syntax, declaring variables when I don't need to, etc.
This is probably slow (guessing) because of the nested loops and searching the line by line over and over again. Is there a better way to restructure the loops to extract multiple lines at once?
Any other speed improvement tips you can give?

Edit: I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.

Can you show what you expect in the TOC file and the BIN file for the example above? — Sinan Ünür, Dec 19 '11 at 23:23
@SinanÜnür The TOC file would look something like this: Note the offset numbers are made up. Param1,UI,0,10,20, Param2,D,20,30,40, Param3,TXT,40,50,60, Where the the format is Name, type, offset to timeStart, offset to time end, offset to data end. So all that would be needed in Matlab is to fread the binary file from start to end offset using the appropriate data type. — Aero Engy, Dec 20 '11 at 00:33
@SinanÜnür I will only write the first paramater out as it would sort of appear in binary. I will use hex notation although this would be a binary file. Also I am writing the timestamps as singles instead of doubles for space. 0x3f800000 0x40400000 0x40a00000 0x00000005 0x0000000A 0x0000000F. So Param1 would have timestart, timeend, and datastart offsets of 0,96,192 (if I added that up correctly) — Aero Engy, Dec 20 '11 at 00:47
@BradGilbert If it is not sorted by parameter name when written to binary it would most likely be difficult and slow to build the data/time arrays in Matlab. — Aero Engy, Dec 20 '11 at 16:03

score 3 · Answer 1 · edited May 23 '17 at 12:06

First, you should always have 'use strict;' and 'use warnings;' pragmas in your script.

It seems like you need a simple array (@param_name) for reference, so loading those values would be straight forward as you have it. (again, adding the above pragmas would start showing you errors, including the $line = =~ s/^\s+//; line!)

I suggest you read this, to understand how you can load your data file into a Hash of Hashes. Once you've designed the hash, you simply read and load the file data contents, and then iterate through the contents of the hash.

For example, using time as the key for the hash

%HoH = (
    1 => {
        name   => "Param1",
        ty       => "UI",
        value       => "5",
    },
    2 => {
        name   => "Param3",
        ty       => "TXT",
        value       => "Some Text 1",
    },
    3 => {
        name   => "Param1",
        ty       => "UI",
        value       => "10",
    },
);

Make sure you close the INFILE after reading in the contents, before you start processing.

So in the end, you iterate over the hash, and reference the array (instead of the file contents) for your output writes - I would imagine it would be much faster to do this.

Let me know if you need more info.

Note: if you go this route, include Data:Dumper - a significant help to printing and understanding the data in your hash!

That sounds promising. I was leery of building large arrays/structs etc. because of the size of the files. When I tried this in Matlab, which isn't so hot at managing memory, I would either run out of memory or start paging endlessly? So I was trying to not read too much into memory at one time. I will read up on the Hashes and give it a shot. In the end I just need arrays of time, and data (grouped by parameter) written to binary ... more or less in the format described in the Continue Block of my code sample. This is so I can plot the data vs time in Matlab — Aero Engy, Dec 20 '11 at 00:01

score 1 · Answer 2 · answered Dec 19 '11 at 22:53

It seems to me that embedded spaces can only occur in the last field. That makes using split ' ' feasible for this problem.

I am assuming you are not interested in the header. In addition, I am assuming you want a vector for each parameter and are not interested in timestamps.

To use data file names specified on the command line or piped through standard input, replace <DATA> with <>.

#!/usr/bin/env perl

use strict; use warnings;

my %data;

$_ = <DATA> until /^-+/; # skip header

while (my $line = <DATA>) {
    $line =~ s/\s+\z//;
    last unless $line =~ /\S/;

    my (undef, $param, undef, $value) = split ' ', $line, 4;
    push @{ $data{ $param } }, $value;
}

use Data::Dumper;
print Dumper \%data;

__DATA__
Param1   filter = ALL_VALUES
Param2   filter = ALL_VALUES
Param3   filter = ALL_VALUES

Time                     Name     Ty  Value
---------- ---------------------- --- ------------
1          Param1                 UI  5
2          Param3                 TXT Some Text 1
3          Param1                 UI  10
4          Param2                 D   2.1234
5          Param1                 UI  15
6          Param2                 D   3.1234
7          Param3                 TXT Some Text 2

Output:

$VAR1 = {
          'Param2' => [
                        '2.1234',
                        '3.1234'
                      ],
          'Param1' => [
                        '5',
                        '10',
                        '15'
                      ],
          'Param3' => [
                        'Some Text 1',
                        'Some Text 2'
                      ]
        };

Some of the parameter names do have spaces. So split might cause issues. There are also several other fields/columns that I did not include in the example for simplicity (Param Description, Units, & Status) and they also frequently contain spaces. That is why I was using the --- line above the start of the data. The --- line has spaces indicating the field widths of each line. That is why I was using the unpack function. — Aero Engy, Dec 20 '11 at 13:58

score 1 · Answer 3 · answered Dec 20 '11 at 17:10

First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.

foreach $current_param (@param_name) {
    ...
    seek(INFILE,$data_start_pos,0); #Jump to data start
    while ($line = <INFILE>) { ... }
    ...
}

Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.

Now on to make it more performant.

I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.

This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.

#!/usr/bin/perl

use strict;
use warnings;
use File::Temp 'tempfile';
use File::Copy 'copy';
use autodie qw':default copy';
use 5.10.1;

my $input_filename = shift @ARGV;
open my $input, '<', $input_filename;

my @param_names;
my $template = ''; # stop uninitialized warning
my @field_names;
my $field_name_line;
while( <$input> ){
  chomp;
  next if /^\s*$/;
  if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
    push @param_names, $param;
  }elsif( /^[\s-]+$/ ){
    my @fields = split /(\s+)/;
    my $pos = 0;
    for my $field (@fields){
      my $length = length $field;
      if( substr($field, 0, 1) eq '-' ){
        $template .= "\@${pos}A$length ";
      }
      $pos += $length;
    }
    last;
  }else{
    $field_name_line = $_;
  }
}

@field_names = unpack $template, $field_name_line;
for( @field_names ){
  s(^\s+){};
  $_ = lc $_;
  $_ = 'type' if substr('type', 0, length $_) eq $_;
}

my %temp_files;
for my $param ( @param_names ){
  for(qw'time data'){
    my $fh = tempfile 'temp_XXXX', UNLINK => 1;
    binmode $fh, ':raw';
    $temp_files{$param}{$_} = $fh;
  }
}

my %convert = (
  TXT => sub{ pack 'A*', join "\n", @_ },
  D   => sub{ pack 'd*', @_ },
  UI  => sub{ pack 'L*', @_ },
);

sub print_time{
  my($param,$time) = @_;
  my $fh = $temp_files{$param}{time};
  print {$fh} $convert{D}->($time);
}

sub print_data{
  my($param,$format,$data) = @_;
  my $fh = $temp_files{$param}{data};
  print {$fh} $convert{$format}->($data);
}

my %data_type;
while( my $line = <$input> ){
  next if $line =~ /^\s*$/;
  my %fields;
  @fields{@field_names} = unpack $template, $line;

  print_time( @fields{(qw'name time')} );
  print_data( @fields{(qw'name type value')} );

  $data_type{$fields{name}} //= $fields{type};
}
close $input;

open my $bin, '>:raw', $input_filename.".bin";
open my $toc, '>',     $input_filename.".toc";

for my $param( @param_names ){
  my $data_fh = $temp_files{$param}{data};
  my $time_fh = $temp_files{$param}{time};

  seek $data_fh, 0, 0;
  seek $time_fh, 0, 0;

  my @toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );

  copy( $time_fh, $bin, 8*1024 );
  close $time_fh;
  push @toc_line, sysseek($bin, 0, 1);

  copy( $data_fh, $bin, 8*1024 );
  close $data_fh;
  push @toc_line, sysseek($bin, 0, 1);

  say {$toc} join ',', @toc_line, '';
}

close $bin;
close $toc;

Thanks for the input! I added what I have written so far before I saw your answer. I may try to incorporate using temp files to keep memory usage low. Some of the data files in theory could get pretty gigantic. I also updated the original post for a slightly better sample data file. — Aero Engy, Dec 20 '11 at 20:13
@AeroEngy I would like to find out how well this program works on the actual data. If your data easily fit's into RAM then this example might be a little overkill. — Brad Gilbert, Dec 20 '11 at 20:47

Aero Engy · Accepted Answer · 2011-12-21T16:32:27.487

I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully

On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.

Please let me know if I am making any rookie mistakes.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);

my $line;
my @param_names;
my @template;
while ($line = <INFILE>) {
    chomp $line; #Remove New Line
    if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
       push @param_names, trim($line);
    }
    elsif ($line =~ /^----/) {
        @template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
        $template[-1] = 'A*';
        my $data_start_pos = tell INFILE;
        last; #Reached start of data exit loop
    }
}

my $size = $#param_names+1;
my @getType = ((1) x $size);
my $template = "@template";
my @lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
    if ($lineCount % 100000 == 0){
        print "On Line: ".$lineCount."\n";
    }
    if ($line =~ /^\d/) { 
        chomp($line);
        @lineData = unpack $template, $line;
        my ($inHeader, $headerIndex) = findStr($lineData[1], @param_names);
        if ($inHeader) { 
            push @{$dataHash{$lineData[1]}{time} }, $lineData[0];
            push @{$dataHash{$lineData[1]}{data} }, $lineData[3];
            if ($getType[$headerIndex]){ # Things that only need written once
                $dataHash{$lineData[1]}{type}  = $lineData[2];
                $getType[$headerIndex] = 0;
            }
        }
    }  
$lineCount ++; 
} # END WHILE <INFILE>
close(INFILE);

print Dumper \%dataHash;

#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", @_}, D=>sub{pack 'd*', @_}, UI=>sub{pack 'L*', @_});

open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';

for my $param (@param_names){
    my $data = $dataHash{$param};
    my @toc_line = ($param, $data->{type}, tell $binfile );
    print {$binfile} $convert{D}->(@{$data->{time}});
    push @toc_line, tell $binfile;
    print {$binfile} $convert{$data->{type}}->(@{$data->{data}});
    push @toc_line, tell $binfile;
    print {$tocfile} join(',',@toc_line,''),"\n";
}

sub trim { #Trim leading and trailing white space
  my (@strings) = @_;
  foreach my $string (@strings) {
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    chomp ($string);
  } 
  return wantarray ? @strings : $strings[0];
} # END SUB

sub findStr { #Return TRUE if string is contained in array.
    my $searchStr = shift;
    my $i = 0;
    foreach ( @_ ) {
        if ($_ eq $searchStr){
            return (1,$i);
        }
    $i ++;
    }
    return (0,-1);
} # END SUB

The output is as follows:

$VAR1 = {
          'Param 1' => {
                         'time' => [
                                     '1.1',
                                     '3.2',
                                     '5.3'
                                   ],
                         'type' => 'UI',
                         'data' => [
                                     '5',
                                     '10',
                                     '15'
                                   ]
                       },
          'Param 2' => {
                         'time' => [
                                     '4.5',
                                     '6.121'
                                   ],
                         'type' => 'D',
                         'data' => [
                                     '2.1234',
                                     '3.1234'
                                   ]
                       },
          'Param 3' => {
                         'time' => [
                                     '2.23',
                                     '7.56'
                                   ],
                         'type' => 'TXT',
                         'data' => [
                                     'Some Text 1',
                                     'Some Text 2'
                                   ]
                       }
        };

Here is the output TOC File:

Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,

Thanks everyone for their help so far! This is an excellent resource!

EDIT: Added Binary & TOC file writing code.

Try this for writing the binary, and toc `my %convert = (TXT=>sub{pack 'A*', join "\n", @_ },D=>sub{ pack 'd*', @_},UI=>sub{pack 'L*', @_}); open my $binfile, '>:raw', $lineArg1.'.bin'; open my $tocfile, '>', $lineArg1.'.toc'; for my $param (@param_names){ my $data = $dataHash{$param};my @toc_line = ($param, $data->{type}, tell $binfile ); print {$binfile} $convert{D}->(@{$data->{time}}); push @toc_line, tell $binfile; print {$binfile} $convert{$data->{type}}->(@{$data->{data}}); push @toc_line, tell $binfile; print {$tocfile} join(',',@toc_line,''),"\n"; }` — Brad Gilbert, Dec 20 '11 at 21:12
@BradGilbert I added your code to write the binary & TOC files. It appears to be working correctly. Thanks! — Aero Engy, Dec 21 '11 at 17:34

Parsing unsorted data from large fixed width text

4 Answers4

Linked