Stop scanning XML file after first match found with Twig handler

Question

I have been using the perl module XML::Twig for a while now. I use the module to load entire XML files into memory before scanning them and also twig handlers. I have a verge large XML file (usually 100s of Megabytes in size) stored in a zip archive. The file in question contains a single tag that I need to find. I use Twig handlers to find the tag and extract one of its attributes. The tag resides at the top of the XML file.

The problem I have is that the time to find the tag and its associated attribute value can take a few minutes. This is because the handler scans the entire file. Is there a way to stop the handler after the match has been found (will always only be one) and not have to continue scanning the file? Here is the code I use:

use strict;
use warnings;
use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
use XML::Twig;
use Data::Dumper;
my $File_To_Analyse='ZipFile1.zip';
my $start_Z=time;
my $zip = Archive::Zip->new();
$zip->read($File_To_Analyse);
my $XML_File = $zip->contents('XML_File.xml');
my $duration_Z = time - $start_Z;
my $duration_Split_Z=CalcTimeHMS($duration_Z);
print "Total Time for ZIP part:" . $duration_Split_Z . "\n";
#
#
#
#
my @Data=(-1);
my $start_T=time;
my $t1= XML::Twig->new(twig_roots => 
                      {'tagToLookFor' => 
                       sub {Get_Tag_Info(@_,\@Data);}})->parse($XML_File);
#
my $duration_T = time - $start_T;
my $duration_Split_T=CalcTimeHMS($duration_T);
print "Total Time for TWIG part:" . $duration_Split_T . "\n";
print Dumper \@Data;
#
#
#
sub Get_Tag_Info{
    my( $t, $elt, $Data)= @_;
    my $Accession = $elt->att('attname');
    if(defined $Accession){
        print "Res:" . $Accession . "\n";
        $Data[0]=$Accession;
    }
    $elt->purge;
}

I put the timings in to see that loading the zipfile part was not taking too long. This takes a couple of seconds on my system. the "Res" print statement follows a couple of seconds afterwards and then the program terminates after a further 2 minutes. This tells me that the majority of the time is taken to scan the file after I have found my match. Can I stop this?

`Get_Tag_Info(@_\@Data)` Syntax error? And you have not defined `@Data`? — Håkon Hægland, Aug 02 '19 at 09:45
Have now edited post (thanks Håkon)...Apologies, have now looked at duplicate question which does indeed answer mine (thanks to simbabque) — Chazg76, Aug 05 '19 at 07:57

Stop scanning XML file after first match found with Twig handler

0 Answers0