4

I have this script in Perl, and it is getting a "Out of memory" error after a few minutes of running. I can't see any circular references and I can't work out why it is happening.

use feature 'say';
use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use utf8;

$url = "some url";

my $mech = new WWW::Mechanize;
$mech->get($url);
my $html = HTML::TreeBuilder::XPath->new_from_content($mech->content);
my $html2;

do { 
    for $item ($html->findnodes('//li[@class="dataset-item"]'))
    {
        my $title = $item->findvalue('normalize-space(.//a[2])');
        next unless $title =~ /environmental impact statement/i;        
        my $link = $item->findvalue('.//a[2]/@href');
        $mech->get($link);
        $html2 = HTML::TreeBuilder::XPath->new_from_content($mech->content);
        my @pdflinks = $html2->findvalues('//a[@title="Go to external URL"]/@href');
        my $date = $html2->findvalue('//tr[th="Date Created"]/td');
        for $pdflink (@pdflinks)
        {
            next unless $pdflink =~ /\.pdf$/;
            $mech->get($pdflink);
            $mech->save_content($filename = $mech->response->filename);
            say "Title: $title\nDate: $date\nFilename: $filename\n";
        }
    }
    if ($nextpage = $html->findvalue('//ul[@class="pagination"]/li/a[.="»"]/@href'))
    {
        say "Next Page: $nextpage\n";
        $mech->get("some site" . $nextpage);
        $html = HTML::TreeBuilder::XPath->new_from_content($mech->content);
    }
} while ($nextpage);

say "Completed.";
brian d foy
  • 129,424
  • 31
  • 207
  • 592
CJ7
  • 22,579
  • 65
  • 193
  • 321
  • I have done this which seems to have helped: `my $mech = WWW::Mechanize->new(stack_depth=>0);` – CJ7 May 07 '20 at 07:40
  • After a few minutes I've accumulated around a Gb of pdf files on disk (and perl's memory use keeps climging) --- my guess: mech object is holding all that content as it keeps browsing. iTesting... – zdim May 07 '20 at 08:05
  • @zdim I think my comment above has solved the issue. In the Mechanize docs it recommends that if your memory is being eaten up. – CJ7 May 07 '20 at 08:20

1 Answers1

7

Since WWW::Mechanize by default has its user agent keep all history while browsing

  • stack_depth => $value

Sets the depth of the page stack that keeps track of all the downloaded pages. Default is effectively infinite stack size. If the stack is eating up your memory, then set this to a smaller number, say 5 or 10. Setting this to zero means Mech will keep no history.

Thus the object keeps growing. By using Devel::Size qw(total_size) I track the size of $mech to see that it adds tens of kB after each pdf. And the script apparently gets a lot of matches; I quit my test when it gobbled up 10% of memory (and had many dozens of files with over a Gb on disk).

One solution then is to instantiate a new object for, say, each $item. That is wasteful in principle but it doesn't in fact add much overhead while it will limit the maximum size.

Or reset it, or indeed limit its stack depth. Since the code doesn't seem to need to go back to previous states at all there is no need really for any stack, so your solution to drop it is quite fine.

Comments

  • To be precise, there is no "leak" in the script; it just takes more and more memory

  • Always have use strict; and use warnings; at the top of a script

  • It's better to not use indirect object syntax to instantiate an object (new Package), but rather a normal method call (Package->new), to avoid dealing with ambiguities in some cases. See explanation in docs and on this page, and examples of trouble in this post and this post.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • (A note about testing I mention: The initial version of the question provided the URL) – zdim May 07 '20 at 08:55
  • It does need `use utf8` because of the `»` characters. – CJ7 May 07 '20 at 10:52
  • And there's no harm in using `use utf8;` even if only characters from the ASCII character set are used. – ikegami May 07 '20 at 18:27
  • @ikegami Sure, i was commenting on an unneeded (I thought) pragma, as a matter of principle. – zdim May 07 '20 at 18:31
  • @ikegami (Strangely, the script runs just fine without the pragma ...?) – zdim May 07 '20 at 18:31
  • @ikegami must be me somehow ... this: `perl -wE'@u = qw(Æ &); say "@u"'` isn't supposed to work, right? It prints those back at me w/o fuss.. – zdim May 07 '20 at 18:39
  • That's different. Bytes >=0x80 in string literals will result in that character in the string. So you are ending up putting encoded text into `@u`, and outputting it without re-encoding it. In general, it's better to decode inputs and encode outputs rather than working with encoded text throughout your program, but it's not technically a bug to work with encoded text. – ikegami May 07 '20 at 18:44
  • @ikegami oh, right, thank you. I thought that something in the program would choke on those bytes (perhaps it does, resulting in different operation) – zdim May 07 '20 at 18:51
  • hmm. [No apparent instance of The Unicode Bug](https://pastebin.com/E1ZX57Jf) Are you sure the file was encoded using UTF-8 when you didn't include `use utf8;`? That doesn't work for me. – ikegami May 07 '20 at 18:52
  • @ikegami "_file was encoded using UTF-8_" -- what do you mean? I take the posted code and remove `use utf8;` (it had the URL, which I can provide if that were to help), and it "works." (Again, perhaps incorrectly since those `»` are in search strings...). That's what I meant -- did my words imply something else? – zdim May 07 '20 at 19:35
  • You said the program worked without `use utf8;`, but I believe otherwise. If you left out `use utf8;`, and if the source code was encoded using UTF-8, the program would not have worked. (It would not have found the correct elements.) – ikegami May 07 '20 at 19:42
  • @ikegami If you mean the scrip itself: `: Perl script, UTF-8 Unicode text executable` – zdim May 07 '20 at 19:45
  • @ikegami OK, by "_would not have worked_" we mean that it wouldn't find those things (but that it runs). That's possible to have happened and not easy to assert: these characters in question are a part of an XPath expression in a separate `if` -- there's plenty to find otherwise (as the scrip does). I expected an actual problem, too (added `warnings` to my runs), but it ran nicely; just probably incorrectly. That'd be a "nice" bug. – zdim May 07 '20 at 19:48
  • @ikegami (hm, no, if that `findvalue` fails the show stops since `$next_page` is `undef`/false ... perhaps when I quit my run it was still in the first `for` loop) – zdim May 07 '20 at 19:53
  • Feel free to test using [this](https://pastebin.com/rTiJFxaE) – ikegami May 07 '20 at 19:58
  • @ikegami Confirmed: when I let the program run w/o `use utf8` it actually finishes, never getting to 'next' page. With `utf8` I have to stop it, and by then there's 'next' and many more files, as in my original test. That's a really "good" bug, very possible --- start w/o utf8 and thus no pragma and then later copy-paste some patterns from somewhere and not notice funny chars – zdim May 07 '20 at 20:20
  • @ikegami Thank you for the test-script, it's great :) (as well as the previous one). And works as expected – zdim May 07 '20 at 20:20
  • And that's why it's not necessarily a bad idea to always use `use utf8;` :) Well, until you accidentally save the file as cp1252. – ikegami May 07 '20 at 20:23
  • In my opinion I don't think `Mechanize` should default to infinite stack depth. You would expect a user agent to function properly over a long period of time, just like any browser does. – CJ7 May 22 '20 at 04:23
  • @CJ7 Agreed. For one, it could have an "adaptive" depth as a default, whereby it would adjust if memory usage goes too far. That wouldn't be hard to set up. – zdim May 22 '20 at 04:32
  • @zdim What is the problem in using "indirect object syntax" to create an object? – CJ7 Jul 21 '20 at 05:52
  • @CJ7 It _may_ lead the interpreter to do unexpected things with some code as it's ambiguous. In most cases, and like in your example, it's fine. But since it can be a problem it's just better to use the normal method call. More specific explanation: [this page](https://stackoverflow.com/q/32955172/4653379) and [docs themselves](https://perldoc.perl.org/perlobj.html#Invoking-Class-Methods). Great examples of trouble: [this post](https://stackoverflow.com/q/11695110/4653379) and [this post](https://stackoverflow.com/q/57989765/4653379). – zdim Jul 21 '20 at 07:40
  • @CJ7 Expanded that comment in the answer – zdim Jul 21 '20 at 07:48