0

Possible Duplicate:
How can I get dynamically web content using Perl?

I've been reviewing ways to get javascript to execute on webpages but don't fully understand some of the solutions.

I'm going to use TechCrunch as an example. If you check out an article on TechCrunch, you'll see at the top of each page, they have a visual to show how many tweets, likes, comments, have been done for that page. If I wanted to scrape this page and gather that information, is there a solution for this in perl?

I've looked at the WWW::Scripter and the WWW::Mechanize::Plugin::JavaScript. Possibly it's because I don't fully understand what is provided with them. But is there a way for me to pass in a URL and have it run the javascript on the page, as a browser would, without having to pass variables, or anything else special to get it to execute.

Community
  • 1
  • 1
user985590
  • 41
  • 1
  • 2
  • 6
  • OR http://stackoverflow.com/questions/2655034/how-can-i-use-perl-to-grab-text-from-a-web-page-that-is-dynamically-generated-wi – epascarello Oct 15 '12 at 16:19
  • Note that if you do find a way of scraping techcrunnch's page and they catch you doing it, they will do everything they can, technologically if not legally, to put you out of business. – Paul Tomblin Oct 15 '12 at 16:22
  • Thanks for the other links. I'll look at those and close this question if need be. For some reason they didn't come up in my searches :(. @Paul I fully understand the legalities of scraping pages. I used TechCrunch as an example. Your answer doesn't really apply to what's being asked. – user985590 Oct 15 '12 at 17:15

1 Answers1

0

This is very difficult to do. You would essentially have have your Perl run a full-on browser engine that loads and runs the desired page and then detect when the page is loaded, you would somehow have to reach into that browser engine to get access to the DOM (probably with injected javascript) to read out values from that page. It is this complicated because the data you want is not present in the page HTML, it is inserted in the page via javascript.

A more practical solution would involve reverse engineering where the page itself is getting the data from and then constructing your own web calls from your perl on your server that fetch the data from the same place that the page is fetching it from.

In either case, if you are not using public, documented APIs your method is subject to breaking at any time if the host changes the way they get the data.

jfriend00
  • 683,504
  • 96
  • 985
  • 979