5

I am using XML::Twig to parse through a very large XML document. I want to split it into chunks based on the <change></change> tags.

Right now I have:

my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);

sub parseChange {

  my ($xml, $change) = @_;

  my $message = $change->first_child('message');
  my @lines   = $message->children_text('line');

  foreach (@lines) {
    if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
      print outputData "$_\n";
    }
  }

  outputData->flush();
  $change->purge;
}

Right now this is running the parseChange method when it pulls that block from the XML. It is going extremely slow. I tested it against reading the XML from a file with $/=</change> and writing a function to return the contents of an XML tag and it went much faster.

Is there something I'm missing or am I using XML::Twig incorrectly? I'm new to Perl.

EDIT: Here is an example change from the changes file. The file consists of a lot of these one right after the other and there should not be anything in between them:

<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>      
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>      
<author_name>Jean-Baptiste Queru</author_name>      
<author_e-mail>jbq@google.com</author_e-mail>      
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>      
<commiter_name>Jean-Baptiste Queru</commiter_name>      
<commiter_email>jbq@google.com</commiter_email>      
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>      
<subject>chmod the output scripts</subject>      
<message>         
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>      
</message>      
<target>         
    <line>generate-blob-scripts.sh</line>      
</target>   
</change>
user1897691
  • 2,331
  • 3
  • 14
  • 12
  • I don't think it is a good idea to pre-process the XML with a regex before passing it to `XML::Twig`. It makes your code a lot less robust. What if there is a `` within a comment, for example? Also, it is unlikely that the XML parsing is the thing slowing down your script. Could you give more information: the size of the file and what kind of processing you are doing? – dan1111 Dec 12 '12 at 11:58
  • I'm not using regex anywhere at the moment. One method was using twig and the other was reading it in and parsing it myself. I extracted this piece from the overall script so it is the only thing that is running. Also, the file size is 2.3gb. I am extracting data from the xml and adding some of it to hashes. – user1897691 Dec 12 '12 at 12:00
  • Sorry, it was a mistake to say "regex". I meant that if you break up the file using some rule (such as the line separator) before parsing it, you might break the integrity of the XML. How big is your XML file? – dan1111 Dec 12 '12 at 12:02
  • The file is 2.3gb. It is a change log from a git repository that is in xml format – user1897691 Dec 12 '12 at 12:03
  • More information: It was sitting there parsing for at least an hour whereas the one where the line separator was used took about 15-20 minutes to do the same thing. – user1897691 Dec 12 '12 at 12:06
  • Ah yes, it is probably just using up your memory. See my answer below for how you can avoid loading it all into memory. – dan1111 Dec 12 '12 at 12:10
  • This implementation was actually very easy on the memory whereas the line separator implementation took up over 3gb of memory before terminating. – user1897691 Dec 12 '12 at 12:18
  • A 2GB XML file is unmanageable. Relational data of this size must be stored in a database to be accessed at any real speed. What is it that you need to do? If the XML is to have any purpose then it will be imported to a database at some stage. Attempting to change it in its serialised form is a bad idea. – Borodin Dec 12 '12 at 18:56
  • Of course a sequential read is *very* much faster. You are asking `XML::Twig` (via `XML::Parser`) to do all that a sequential read does, and in addition to build a parse tree from it and trigger callbacks on selected nodes. XML is a *sequential representation of non-sequential data*, and you must either tolerate the penalty for using XML or import it into a relational database before you manipulate it. – Borodin Dec 12 '12 at 19:06
  • @dan1111: Memory is not the immediate problem. An arbitrary sequential read is always going to be much faster than building a tree structure from any incoming data. – Borodin Dec 12 '12 at 19:13
  • I am just wondering why reading it in as plaintext is so much faster albeit more taxing on the memory than using an xml parser. I'll looking to extract data and preform statistical analysis. I don't wish to update or maintain it. – user1897691 Dec 13 '12 at 00:15

5 Answers5

3

As it stands, your program is processing all of the XML document, including the data outside the change elements that you aren't interested in.

If you change the twig_handlers parameter in your constructor to twig_roots, then the tree structures will be built for only the elements of interest and the rest will be ignored.

my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • I will try this but the document should just be a bunch of changes right after each other anyway. I have started running it and it looks to be about the same speed as before. – user1897691 Dec 12 '12 at 18:10
  • Then you should import your XML into [`SQLite`](https://metacpan.org/module/DBD::SQLite), work on it from there and export it afterwards. XML is not a random-access database format. – Borodin Dec 12 '12 at 19:10
1

XML::Twig includes a mechanism by which you can handle tags as they appear, then discard what you no longer need to free memory.

Here is an example taken from the documentation (which also has a lot more helpful information):

my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p'); }
                          },
                       );
  $t->parsefile( 'doc.xml');

  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name.4, my favourite method...
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->att( 'nb'); # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

This will probably be essential when working with a multi-gigabyte file, because (again, according to the documentation) storing the entire thing in memory can take as much as 10 times the size of the file.

Edit: A couple of comments based on your edited question. It is not clear exactly what is slowing you down without knowing more about your file structure, but here are a few things to try:

  • Flushing the output filehandle will slow you down if you are writing a lot of lines. Perl caches file writing specifically for performance reasons, and you are bypassing that.
  • Instead of using the (?i) mechanism, a rather advanced feature that probably has a performance penalty, why not make the whole match case insensitive? /[^a-z0-9]bug[^a-z0-9]/i is equivalent. You also might be able to simplify it with /\bbug\b/i, which is nearly equivalent, the only difference being that underscores are included in the non-matching class.
  • There are a couple of other simplifications that can be made as well to remove intermediate steps.

How does this handler code compare to yours speed-wise?

sub parseChange
{
    my ($xml, $change) = @_;

    foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
    {
        print outputData "$_\n";
    }

    $change->purge;
}
dan1111
  • 6,576
  • 2
  • 18
  • 29
  • I did look at this a little although I must admit that I am a tad confused about the `para` line. I think this is what I'm doing. You can see in my sample code there that I did define a handler. – user1897691 Dec 12 '12 at 12:13
  • @user1897691, did you `flush` or `purge` to free memory within your handler? I'm not an expert on `XML::Twig`, but if you post the code of your handler someone might be able to help you more. – dan1111 Dec 12 '12 at 12:31
  • Ok I added it to my original question. I'm sure someone will comment about how FileIO is expensive but it is being done in both versions of the code and I'm getting different timings. FileIO is not the reason why one is running so much faster than the other. – user1897691 Dec 12 '12 at 12:34
  • @user1897691, updated with a few suggestions. If this doesn't help, providing some more information on your file might: how many change tags, how many lines within the section you are searching, how many times do you match a bug line and print it? etc. – dan1111 Dec 12 '12 at 13:09
  • @Borodin, this was not clear originally, as the OP had not shown the handler sub, so it wasn't clear that the tree was being purged or flushed. – dan1111 Dec 12 '12 at 14:45
  • I know that Perl caches writing which is why I had to manually flush there. I also happen to be writing a lot of data since the word bug appears in this file many times. Caching the entire thing was locking up my computer and crashing other applications. Also, while those regex changes may present a small speedup, it still would not explain why using `Twig` has been so much slower than doing the parsing myself. – user1897691 Dec 12 '12 at 18:13
  • @user1897691: `XML::Twig` is so much slower because it is analysing the entire document as an XML data tree. Reading random bytes with `$/` set to `` is *not* "doing the parsing yourself". If you have much of this to do then you should import the data into a proper database. If it is a one-off that you are writing then just accept the hit. – Borodin Dec 12 '12 at 19:16
  • Is there an easy way to port the xml to an sqlite db? – user1897691 Dec 13 '12 at 00:16
0

If your XML is really big, use XML::SAX. It doesn't have to load entire data set to the memory; instead, it sequentially loads the file and generates callback events for every tag. I successfully used XML::SAX to parse XML with size of more than 1GB. Here is an example of a XML::SAX handler for your data:

#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    bless { data => '', path => [] }, shift;
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
    push @{$self->{path}} => $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;
    if ($self->{path} ~~ [qw[change message line]]) {
        say $self->{data};
    }
    pop @{$self->{path}};
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use XML::SAX::PurePerl;

my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_file(\*DATA);

__DATA__
<?xml version="1.0"?>
<change>
  <project>device_common</project>
  <commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
  <tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
  <parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
  <author_name>Jean-Baptiste Queru</author_name>
  <author_e-mail>jbq@google.com</author_e-mail>
  <author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
  <commiter_name>Jean-Baptiste Queru</commiter_name>
  <commiter_email>jbq@google.com</commiter_email>
  <committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
  <subject>chmod the output scripts</subject>
  <message>
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
  </message>
  <target>
    <line>generate-blob-scripts.sh</line>
  </target>
</change>

Outputs

Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f
creaktive
  • 5,193
  • 2
  • 18
  • 32
  • If this is quicker, then it will do the trick. However I am also looking for other information from the xml other than the line you pulled in your example. How can I pull the data in a certain tag of my specification? – user1897691 Dec 13 '12 at 00:18
  • The provided example detects the tag via `if ($self->{path} ~~ [qw[change message line]]) { ... }` condition. So, to pick up an `author_name`, add a condition `$self->{path} ~~ [qw[change author_name]]`. – creaktive Dec 13 '12 at 10:27
0

Not an XML::Twig answer, but ...

If you're going to extract stuff from xml files, you might want to consider XSLT. Using xsltproc and the following XSL stylesheet, I got the bug-containing change lines out of 1Gb of <change>s in about a minute. Lots of improvements possible, I'm sure.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >

  <xsl:output method="text"/>
  <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
  <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />

  <xsl:template match="/">
    <xsl:apply-templates select="changes/change/message/line"/>
  </xsl:template>

  <xsl:template match="line">
    <xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
    <xsl:if test="contains($lower,'bug')">
      <xsl:value-of select="."/>
      <xsl:text>
</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

If your XML processing can be done as

  1. extract to plain text
  2. wrangle flattened text
  3. profit

then XSLT may be the tool for the first step in that process.

0

Mine's taking an horrifically long time.

    my $twig=XML::Twig->new
  (
twig_handlers =>
   {
    SchoolInfo => \&schoolinfo,
   },
   pretty_print => 'indented',
  );

$twig->parsefile( 'data/SchoolInfos.2018-04-17.xml');

sub schoolinfo {
  my( $twig, $l)= @_;
  my $rec = {
                 name   => $l->field('SchoolName'),
                 refid  => $l->{'att'}->{RefId},
                 phone  => $l->field('SchoolPhoneNumber'),
                };

  for my $node ( $l->findnodes( '//Street' ) )    { $rec->{street} = $node->text; }
  for my $node ( $l->findnodes( '//Town' ) )      { $rec->{city} = $node->text; }
  for my $node ( $l->findnodes( '//PostCode' ) )  { $rec->{postcode} = $node->text; }
  for my $node ( $l->findnodes( '//Latitude' ) )  { $rec->{lat} = $node->text; }
  for my $node ( $l->findnodes( '//Longitude' ) ) { $rec->{lng} = $node->text; }     
}

Is it the pretty_print perchance? Otherwise it's pretty straightforward.

Dave Hodgkinson
  • 375
  • 4
  • 13