I have been using the perl module XML::Twig for a while now. I use the module to load entire XML files into memory before scanning them and also twig handlers. I have a verge large XML file (usually 100s of Megabytes in size) stored in a zip archive. The file in question contains a single tag that I need to find. I use Twig handlers to find the tag and extract one of its attributes. The tag resides at the top of the XML file.
The problem I have is that the time to find the tag and its associated attribute value can take a few minutes. This is because the handler scans the entire file. Is there a way to stop the handler after the match has been found (will always only be one) and not have to continue scanning the file? Here is the code I use:
use strict;
use warnings;
use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
use XML::Twig;
use Data::Dumper;
my $File_To_Analyse='ZipFile1.zip';
my $start_Z=time;
my $zip = Archive::Zip->new();
$zip->read($File_To_Analyse);
my $XML_File = $zip->contents('XML_File.xml');
my $duration_Z = time - $start_Z;
my $duration_Split_Z=CalcTimeHMS($duration_Z);
print "Total Time for ZIP part:" . $duration_Split_Z . "\n";
#
#
#
#
my @Data=(-1);
my $start_T=time;
my $t1= XML::Twig->new(twig_roots =>
{'tagToLookFor' =>
sub {Get_Tag_Info(@_,\@Data);}})->parse($XML_File);
#
my $duration_T = time - $start_T;
my $duration_Split_T=CalcTimeHMS($duration_T);
print "Total Time for TWIG part:" . $duration_Split_T . "\n";
print Dumper \@Data;
#
#
#
sub Get_Tag_Info{
my( $t, $elt, $Data)= @_;
my $Accession = $elt->att('attname');
if(defined $Accession){
print "Res:" . $Accession . "\n";
$Data[0]=$Accession;
}
$elt->purge;
}
I put the timings in to see that loading the zipfile part was not taking too long. This takes a couple of seconds on my system. the "Res" print statement follows a couple of seconds afterwards and then the program terminates after a further 2 minutes. This tells me that the majority of the time is taken to scan the file after I have found my match. Can I stop this?