7

I am parsing large XML files (60GB+) with XML::Twig and using it in a OO (Moose) script. I am using the twig_handlers option to parse elements as soon as they're read into memory. However, I'm not sure how I can deal with the Element and Twig.

Before I used Moose (and OO altogether), my script looked as follows (and worked):

my $twig = XML::Twig->new(
  twig_handlers => {
    $outer_tag => \&_process_tree,
  }
);
$twig->parsefile($input_file);


sub _process_tree {
  my ($fulltwig, $twig) = @_;

  $twig->cut;
  $fulltwig->purge;
  # Do stuff with twig
}

And now I'd do it like this.

my $twig = XML::Twig->new(
  twig_handlers => {
    $self->outer_tag => sub {
      $self->_process_tree($_);
    }
  }
);
$twig->parsefile($self->input_file);

sub _process_tree {
  my ($self, $twig) = @_;

  $twig->cut;
  # Do stuff with twig
  # But now the 'full twig' is not purged
}

The thing is that I now see that I am missing the purging of the fulltwig. I figured that - in the first, non-OO version - purging would help on saving memory: getting rid of the fulltwig as soon as I can. However, when using OO (and having to rely on an explicit sub{} inside the handler) I don't see how I can purge the full twig because the documentation says that

$_ is also set to the element, so it is easy to write inline handlers like

para => sub { $_->set_tag( 'p'); }

So they talk about the Element you want to process, but not the fulltwig itself. So how can I delete that if it is not passed to the subroutine?

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239

1 Answers1

7

The handler still gets the full twig, you're just not using it (using $_ instead).

As it turns out you can still call purge on the twig (which I usually call "element", or elt in the docs): $_->purge will work as expected, purging the full twig up to the current element in $_;

A cleaner (IMHO) way would be to actually get all of the parameters and purge the full twig expicitely:

my $twig = XML::Twig->new(
  twig_handlers => {
    $self->outer_tag => sub {
      $self->_process_tree(@_); # pass _all_ of the arguments
    }
  }
);
$twig->parsefile($self->input_file);

sub _process_tree {
  my ($self, $full_twig, $twig) = @_; # now you see them!

  $twig->cut;
  # Do stuff with twig
  $full_twig->purge;  # now you don't
}
mirod
  • 15,923
  • 3
  • 45
  • 65
  • Aah, my bad! I should've inspected `@_` to see what was going on. Thanks! Is there any downside/upside of purging the full twig only *after* you have done stuff with the cut twig? My reasoning was to purge it immediately after cutting the *element*, so that memory is cleared as soon as possible. I might be wrong? Great module by the way, we use it **all** the time! – Bram Vanroy Jul 23 '17 at 10:58
  • 2
    It should make no difference when you purge. The most important is to reclaim the memory before you start parsing the next subtree. And thanks ;--) – mirod Jul 23 '17 at 11:18