You will want to process the entire file just once instead of once for each hit you are searching for. You can use awk,python,perl,etc, but you want to read your list of hits just once and read your large file just once.
Assuming the iteration objects cannot be nested using perl you can do something like this
#!/usr/bin/perl -w
use strict; use English;
use IO::Handle;
sub run
{
my $outputFile = ‘output.xml’;
my $largeFile = ‘laregInput.xml’;
open(my $hitsfile, ‘<’, ‘hits.txt’) or die “could not open hits.txt: $!”;
my @hits = <$hitsfile>;
close($hitsfile) or warn “error closing hits.txt: $!”;
my $hitsRegex = buildHitsRegex(@hits);
my $iterationStopRegex = qr/<Iteration>/;
open(my $infile, ‘<’, $largeFile) or die “could not open $largeFile: $!”;
open(my $outfile, ‘>’, $outputFile) or die “error opening $outputFile for writing: $!”;
$outfile->autoflush; # enable auto flushing to see output in the file it is written not just at after closing the handle
my $printlines = 0;
while(my $line = <$infile>)
{
if(!$printlines && $line =~ $hitsRegex)
{
$printlines = 1;
}
if($printlines)
{
print $outfile $line;
}
if($printlines && $line =~ $iterationStopRegex)
{
$printlines = 0;
}
}
close($infile);
close($outfile);
}
sub buildHitsRegex
{
my @hits = @ARG;
my $firstHit = shift(@hits);
my $hitsRegexStr = '<Iteration_query-def>(?:' . $firstHit;
for my $hit (@hits)
{
$hitsRegexStr .= "|$hit"
}
$hitsRegexStr .= ')';
return qr/$hitsRegexStr/;
}
sub matchesHit
{
my ($line, $hits) = @ARG;
my $iterationStartRegex = qr/<Iteration_query-def>/;
for my $hit (@{$hits})
{
if($line =~ /$iterationStartRegex$hit/)
{
return 1;
}
}
return 0;
}
run();
If you are only trying to match each hit once you could also remove the hit from the list or regex after matching it.
If <Iteration_query_def>
is always at the beginning of the line then you can also optimize some by adding ^
to the beginning of the regex indicating the line must begin with <Iteration_query_def>
. Such as
my $hitsRegexStr = '^<Iteration_query-def>(?:' . $firstHit;
The same applies for <Iteration>
my $iterationStopRegex = qr/^<Iteration>/;
If <Iteration>
is always on its own line you can also add a $
to match the end of line.
my $iterationStopRegex = qr/^<Iteration>$/;