1

Say someone else has a website generated by JavaScript, so I can't go look at the source and read what should be on the screen. How can I grab the text on the screen so I can feed it into another program? Also, how can I write a program that automatically clicks on radio buttons, links, etc. that satisfy certain criteria?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
i love stackoverflow
  • 1,555
  • 3
  • 12
  • 24
  • Do you need to _write_ the program? What if someone else has already written it, and will give it to you for free? – John Saunders Mar 14 '12 at 02:10
  • @JohnSaunders Well that's fine too :D – i love stackoverflow Mar 14 '12 at 02:13
  • 1
    possible duplicate of [What's a good tool to screen-scrape with Javascript support?](http://stackoverflow.com/questions/125177/whats-a-good-tool-to-screen-scrape-with-javascript-support) – John Saunders Mar 14 '12 at 02:31
  • 1
    In that case, this is a duplicate of http://stackoverflow.com/questions/125177/whats-a-good-tool-to-screen-scrape-with-javascript-support. It looks like there are good answers there. – John Saunders Mar 14 '12 at 02:32

3 Answers3

1

You can write a web scraping tool in Perl or Python. Or, you can use existing tools and frameworks to achieve that.

Check out Scrapy, an open-source tool written in Python.

Take a look at Selenium too.

torrential coding
  • 1,755
  • 2
  • 24
  • 34
  • 1
    This might shed some more light: http://stackoverflow.com/questions/125177/whats-a-good-tool-to-screen-scrape-with-javascript-support – torrential coding Mar 14 '12 at 02:08
1

To parse dynamic content you could see the javascript source and get that same content the same way the webpage is getting it. (ie. replicating ajax calls and such)

If you want to submit data (not actually click on the elements) as if it were clicked/edited/selected you could also send a request containing the same data that the server is expecting by using some HTTP library, like CURL. See an example here.

Nathan
  • 4,017
  • 2
  • 25
  • 20
1

If you need to handle content generated by script, then your first problem is to cause the script to execute. Further, the script will want to generate the content into a DOM. That means you need to have a DOM, and a script engine, and probably HTTP access to the Internet, and XML handling, etc.

If that sounds a lot like a web browser, then you're listening.

What you basically need is a web browser that you can control from a program. You'll need to be able to tell it to browse to a page, click buttons and links, etc., then you'll need to read back the resulting DOM.

Only then will you need to parse the page.

If you're in the Microsoft world, then you can use the WebBrowser control. There are several forms of this, and they all amount to the same thing: you can have Internet Explorer run inside of your program, and your program can control it.

I understand there are other browsers that can be controlled from a program, but since I don't know their details, I'll wait for someone else to tell us both.

John Saunders
  • 160,644
  • 26
  • 247
  • 397