Perl: Parsing AJAX loaded content

Question

This is an age-old question regarding perl web scrapers after Web 2.0; they simply cannot parse dynamically loaded pages because they need some sort of JavaScript engine in order to render the page. This issue is much more involved than simply rendering JavaScript, since Perl would also have to be able to manage and maintain the DOM.

It seems WWW::Selenium and WWW::Mechanize::Firefox is able to accomplish this by utilizing FireFox (or other browsers) to do the rendering for it. However, V8 has become so popular (as seen with Node.js), so I'm curious if there are any new libraries that utilize it or there has since been a browser-independent solution, which I'm not aware.

I might usually consider this a closable question, but with so few results when Googling and on Stack Overflow, there shouldn't be too many solutions (if any).

Related (older) Questions:

I'm a little confused...what do you mean by "a browser-independent solution?" If you're scraping a webpage, there will necessarily be differences in the page depending on which browser you use to render it, whether that be Firefox or a headless browser like PhantomJS. Do you just mean you want a solution that doesn't require you to install Firefox? — ThisSuitIsBlackNot, Oct 05 '15 at 18:46
@ThisSuitIsBlackNot yes; technically speaking Mechanize is a browser. What I was trying to get at is the ability to render the result of JS operations, without using a third-party browser, or installing another browser binary. — vol7ron, Oct 05 '15 at 19:39
In other words, you want a pure-Perl Javascript engine? Maybe try [Javascript.pm](https://metacpan.org/pod/JavaScript), although that's not a browser; I think you would have to integrate it with Mechanize yourself. Everything else I can find requires an external binary: WWW::Mechanize::PhantomJS, for example, requires you to install PhantomJS; [JavaScript::V8](https://metacpan.org/pod/JavaScript::V8), also not a browser, requires you to install V8. — ThisSuitIsBlackNot, Oct 05 '15 at 20:29
@ThisSuitIsBlackNot yes. That is what did not exist years ago, but the hope is it could now, especially since new solutions are more immediate these days. The audience at SO may know about the less popular modules, in this case one that is a PP version. I wouldn't mind JavaScript::V8, except I don't think it is integrated into Mechanize. It's easier to separate the two. Using mechanize with firefox/phantom fees like too much middleware. — vol7ron, Oct 05 '15 at 20:33
Haven't used them, but you may want to check out https://metacpan.org/pod/Test::Mojo::Role::Phantom and https://metacpan.org/pod/Mojo::Phantom — oalders, Oct 06 '15 at 16:20

score 0 · Answer 1 · answered Oct 05 '15 at 20:12

0

You mentioned Selenium but there is the later version Selenium::Remote::Driver which works with a selenium 2.0 hub.

I see you can also use it without a Selenium hub Without Standalone Server ( I haven't used this part)

As of v0.25, it's possible to use this module without a standalone server - that is, you would not need the JRE or the JDK to run your Selenium tests. See Selenium::Chrome, Selenium::PhantomJS, and Selenium::Firefox for details. If you'd like additional browsers besides these, give us a holler over in Github.

PhantomJS may be of interest as it is a headless browser

This is probably not an answer but it was too long for a comment

answered Oct 05 '15 at 20:12

KeepCalmAndCarryOn

8,817
2
32
47

This may be interesting. There's also a plugin for Mechanize: http://search.cpan.org/~corion/WWW-Mechanize-PhantomJS-0.02/lib/WWW/Mechanize/PhantomJS/Examples.pm Though, it seems like using Firefox as headless would be accomplishing the same thing. I guess I can't really avoid a 3rd party renderer, Mechanize doesn't seem to have anything more native (was hoping there would be a WWW::Mechanize::V8) – vol7ron Oct 05 '15 at 20:16
@vol7ron FWIW, I don't think WWW::Mechanize::PhantomJS is as well-supported as WWW::Mechanize::Firefox. I submitted a [bug report and patch](https://rt.cpan.org/Public/Bug/Display.html?id=100191) for a rendering issue almost a year ago but there's been no response. For that particular project I ended up just using PhantomJS instead of Perl, although I guess I could have used my patched version of the module, too. I agree, a pure-Perl Javascript-aware browser would be nice. – ThisSuitIsBlackNot Oct 05 '15 at 20:52

Perl: Parsing AJAX loaded content

1 Answers1