0

I'm using Java to parse HTML from a random website, let's say it's http://google.com for convenience. After parsing the HTML data, I want to extract some of that data, and show it on a display. After that the user will get to input a search term, and press a button. This button will execute that script behind the "search" button. I want to do this with several sites, so giving me a way that only works with google won't help me a lot.

Gray
  • 115,027
  • 24
  • 293
  • 354
ZimZim
  • 3,291
  • 10
  • 49
  • 67
  • So what if the button makes an AJAX call? - you'll run into the Same Origin Policy and it will break because the page expects to be on domain X and it's now proxied into domain Y. – Diodeus - James MacFarlane Mar 29 '12 at 18:55
  • I don't understand the question. A website has html--what does google vs. ?? have to do with the html from the website? How does what you display differ from, say, "view source"? – Phil Freihofner Mar 29 '12 at 18:57
  • I think he wants to show screen-scraped pages and have them behave as original pages. – Diodeus - James MacFarlane Mar 29 '12 at 18:58
  • Yes, like diodeus said, but I want to be able to use scripts on that page. Like the Google search button, or the the stackoverflow vote button. For example, that I press a button in my own program that will actually click a vote button on this site (by executing the code behind that button). – ZimZim Mar 30 '12 at 08:55

2 Answers2

0

Edit:

Ah, I see. You are asking about how to call a remote web-page from your code? There are a couple of ways you can do this:


Scraping of websites is a difficult problem and rarely have I found that a single scraper can handle multiple websites. The idea of a generic scraper is just not possible.

I would recommend writing a Java interface which is something like HandleSearchPage. It would contain a method to scrap the page and extract some of the data and another method to submit a search.

Then you can implement your scrapers for Google, Yahoo, etc.. As to how to parse html and drive a webpage there are many other questions/answers on that specific topic.

Best of luck.

Community
  • 1
  • 1
Gray
  • 115,027
  • 24
  • 293
  • 354
  • Oh nonononono my bad, what I meant was, I need an explanation that will make me able to do it for all sites I find, programmatically of course. I absolutely don´t expect a single java code to be able to manipulate all scripts on every website haha. I just don´t want you to give me an explanation that I will only be able to use for google.com. And thanks, I´ll look into your answer. EDIT: You gave me an explanation of how to parse HTML. Like I said, I already know how to parse HTML in several ways. What I need to do is EXECUTE scripts on an external website through my own code. – ZimZim Mar 29 '12 at 19:22
0

Sorry I am not too sure what the quesiton is. - If you want to grab a web page from java and then strip out the html data then that is a task that you can fairly easily do - or use something like nutch. If you want to run the javascript inside a page inside your java then you will need to look at something like rhino.

nutch will spider the pages, and update a database (usually solr) you can then issue searches against the database and display the results.

Symeon Breen
  • 1,531
  • 11
  • 25
  • A good bit of this should be a comment dude. In these cases I say something like "I'm not sure you are talking about XXXX." Then my answer. Then "If you were talking about something else, edit your question." – Gray Mar 29 '12 at 19:34
  • Thanks for the comment Gray - I am a bit new on this site TBH - how do i do a comment - i see on this thread there is a grey Add Comment link, but there is not one under the OP post ? EDIT - ahh i need a 50 rep to add a comment . – Symeon Breen Mar 30 '12 at 11:35