Optimize Perl Script to Correlate Records Between Two Files

Question

open( FH, 'MAH' ) or die "$!";
while ( $lines = <FH> ) {
    $SSA = substr( $lines, 194, 9 );
    open( FH1, 'MAH2' ) or die "$!";
    while ( $array1 = <FH1> ) {
        @fieldnames = split( /\|/, $array1 );
        $SSA1       = $fieldnames[1];
        $report4    = $fieldnames[0];
        if ( $SSA =~ /$SSA1/ ) {
            $report5= $report4;
        }
    }
}

I am trying to extract the "SSA" value from MAH file and search in MAH2 file for that value. If found, return the "report4" value. I am able get the output but it is taking lot of time to process. Is there any way to optimize the code so it finishes quickly?

Each of my files has 300,000 records and file sizes are 15 MB. Currently it is taking 5 hours to process

Hi. For each line in the first file, you open and read completely the second file. Maybe you can read the first file and store its values. Do the same for the second file. Then do your comparison. It should go faster. — , Mar 05 '21 at 15:00
https://stackoverflow.com/questions/4371714/how-do-i-profile-my-perl-programs/4389536 may be useful for you — AKHolland, Mar 05 '21 at 15:11
We need more information to answer this - like what is the content of the two files. You'll optimise significantly by _not_ repeatedly re-opening and re-reading the second file though. — Sobrique, Mar 05 '21 at 15:25
300,000 * 300,000 = 90,000,000,000. The loop body is executed 90 billion times. So you are saying that the loop body is executed 5 million times per second, or once every 200 ns. That's pretty good. Microoptimizations (e.g. only reading the file once, replacing the regex match with a string comparison if possible) won't help that much. [continued] — ikegami, Mar 05 '21 at 16:04
To make this faster, you want to change from using a quadratic algorithm (O(N*M)) to using a linear one (O(N+M)), if possible. This would cut down the time to a tiny fraction of what you are using now. To see if that's possible, we need to know more about SSA and the comparison. Are you trying to the records in the two files that have the same SSA? And if so, is it a one-to-one, a one-to-many or a many-to-many relationship? — ikegami, Mar 05 '21 at 16:07
Also what is target for under which time the script should complete? — James Z, Mar 05 '21 at 16:10
Re "*Are you trying to the records*", Are you trying to *match* the records — ikegami, Mar 05 '21 at 16:13
Yes trying to match the records .SSA is unique number and we had only some of the values in MAH2 file and it is one to one relation — Akhil, Mar 05 '21 at 16:39

ikegami · Answer 1 · 2021-03-05T20:02:30.307

Build a lookup table.

my $foo_qfn = 'MAH';
my $bar_qfn = 'MAH2';

my %foos;
{
   open(my $fh, '<', $foo_qfn)
      or die("Can't open \"$foo_qfn\": $!\n");

   while ( my $foo_line = <$fh> ) {
      my $ssa = substr($foo_line, 194, 9);
      $foos{$ssa} = $foo_line;
   }
}

{
   open(my $fh, '<', $bar_qfn)
      or die("Can't open \"$bar_qfn\": $!\n");

   while ( my $bar_line = <$fh> ) {
      chomp($bar_line);
      my ($report4, $ssa) = split(/\|/, $bar_line);
      my $foo_line = $foos{$ssa};
      ...
   }
}

Your original code took time indirectly proportional to the number of foos times the number of bars (O(N*M)).

This will take time indirectly proportional to the largest of the number of foos and the number of bars (O(N+M)).

In other words, this should be over 100,000 times faster. We're talking seconds, not hours.

score 1 · Answer 2 · edited Mar 07 '21 at 23:13

If your task is just to find the records in file2 that correspond to records in file1 by the SSA field, there is another way to do it that might be faster and simpler than the classic lookup hash table approach.

You can use a regular expression constructed from the records in file1 to parse, match, and extract from file2 in one pass. Yes, Perl can handle regexes with 300,000 alternations! :) This is only reasonable in Perl's whose regex engines can construct alternation trees. (5.10+ You could use Regexp::Assemble before that.)

## YOUR CODE ##
open( FH, 'MAH' ) or die "$!";
while ( $lines = <FH> ) {
    $SSA = substr( $lines, 194, 9 );
    open( FH1, 'MAH2' ) or die "$!";
    while ( $array1 = <FH1> ) {
        @fieldnames = split( /\|/, $array1 );
        $SSA1       = $fieldnames[1];
        $report4    = $fieldnames[0];
        if ( $SSA =~ /$SSA1/ ) {
            $report5= $report4;
        }
    }
}

As regex:

our $file1 = "MAH";
our $file2 = "MAH2";

open our $fh1, "<", $file1 or die $!;
our $ssa_regex = "(?|" . 
    join( "|", 
      map join("", "^([^|]*)[|](", quotemeta($_), ")(?=[|])"), 
      map substr( $_, 194, 9 ), 
      <$fh1> ) .
    ")"
;
close $fh1;

open our $fh2, "<", $file2 or die $!;
our @ssa_matches = do { local $/; <$fh2> =~ m/$ssa_regex/mg; };
close $fh2;
undef $ssa_regex;
die "match array contains an odd number of entries??\n" if @ssa_matches % 2;

while (@ssa_matches) {
  my($report4, $SSA1) = splice @ssa_matches, 0, 2;
  ## do whatever with this information ##

}

Let's break that up with some comments.

Read file1 and build the regex.

our $file1 = "MAH";
our $file2 = "MAH2";

# open file1 as normal
open our $fh1, "<", $file1 or die $!;
# build up a regular expressions that will match all of the SSA fields
our $ssa_regex = 
   # Start the alternation reset group.  This way you always have $1 
   # and $2 regardless of how many groups or total parens there are.
   "(?|" . 
   # Join all the alternations together
    join( "|", 
      # Create one regex group that will match the beginning of the line, 
      # the first "record4" field, the | delimiter, the SSA, and then 
      # make sure the following character is the delimiter.  [|] is 
      # another way to escape the | character that can be more clear 
      # than \|.
      # Escape any weird characters in the SSA with quotemeta(). Omit 
      # this if plain text.
      map join("", "^([^|]*)[|](", quotemeta($_), ")(?=[|])"), 
      # Pull out the SSA value with substr().
      map substr( $_, 194, 9 ), 
      # Read all the lines of file1 and feed them into the map pipeline.
      <$fh1> ) .
    # Add the closing parethesis for the alternation reset group.
    ")"
;
# Close file1.
close $fh1;

Read in file2 and apply the regex.

# Open file2 as normal.
open our $fh2, "<", $file2 or die $!;
# Read all of file2 and apply the regex to get an array of the wanted
# "record4" field and the matching SSA.
our @ssa_matches = 
# Using a do{} block lets do the undef inline.
do { 
# Undefine $/ which is the input record seperator which will let 
# us read the entire file as a single string.
local $/; 
# Read the file as a single string and apply the regex, doing a global 
# multiline match.  /m means to apply the ^ assertion at every line, 
# not just at the beginning of the string.  /g means to perform and 
# return all of the matches at once.
<$fh2> =~ m/$ssa_regex/mg;
};
# Close file2 as normal.
close $fh2;
# Clear the memory for the regex if we don't need it anymore
undef $ssa_regex;

# Make sure we got pairs
die "match array contains an odd number of entries??\n" if @ssa_matches % 2;

# Now just iterate through @ssa_matches two at a time to do whatever
# you wanted to do with the matched SSA values and that "record4" 
# field.  Why is it record4 if it's the first field?
while (@ssa_matches) {
  # Use splice() to pull out and remove the two values from @ssa_matches
  my($report4, $SSA1) = splice @ssa_matches, 0, 2;
  ## do whatever with this information ##

}

The regex could be compacted a bit more if we're being pedantic.

our $ssa_regex = "^([^|]*)[|](" . 
    join( "|", 
      map quotemeta($_), 
      map substr( $_, 194, 9 ), 
      <$fh1> ) .
    ")(?=[|])"
;

I'm not guaranteeing this way is better or faster than any other, but it is a way to do it with fewer steps.

score 0 · Answer 3 · answered Mar 06 '21 at 21:51

ikegami already pointed out a much better approach of storing one file as a lookup table. But allow me to provide some observation of mine and perhaps those can be applicatble as well.

By this expression, we treat $SSA1 as a regular expression:

$SSA =~ /$SSA1/

I find it rare to store regular expressions in files... do you perhaps mean to do sub-string search instead of treating $SSA1 as a regular expression ? if that is the case this can probably be:

index($SSA, $SSA1) >= 0

OTOH in the same if-statement, the reaction after a successful match is:

$report5 = $report4

when there are multiple successful matches in the same inner-loop, the same statement is executed multiple times, which means $report5 stores something corresponding to the last match.

If at most only one match is expected from MAH2, perhaps adding a 'last' to leave the inner-loop.

if ( index($SSA, $SSA1) >= 0 ) {
    $report5 = $report4;
    last;
}

depending on where the match is in MAH2, this could cut some corners. Although, this stops the loop at first match instead of the last match ... which means it is not a directly replacement of yur original cod. If that is still a fit your purpose perhaps it can be used.

However, as the "output" of this piece of program, $report5 is used only once the given piece of code, meaning that for all the 9 billion iteartion we do, only one match really matter -- perhaps it also make sense to leave the outer loop (again, that might not be what you want.)

Re "*do you perhaps mean to do sub-string search instead of treating $SSA1 as a regular expression ?*", They clarified in the comments that they want numerical equality. — ikegami, Mar 07 '21 at 23:08

Optimize Perl Script to Correlate Records Between Two Files

3 Answers3