I'm using Java to parse HTML from a random website, let's say it's http://google.com for convenience. After parsing the HTML data, I want to extract some of that data, and show it on a display. After that the user will get to input a search term, and press a button. This button will execute that script behind the "search" button. I want to do this with several sites, so giving me a way that only works with google won't help me a lot.
-
So what if the button makes an AJAX call? - you'll run into the Same Origin Policy and it will break because the page expects to be on domain X and it's now proxied into domain Y. – Diodeus - James MacFarlane Mar 29 '12 at 18:55
-
I don't understand the question. A website has html--what does google vs. ?? have to do with the html from the website? How does what you display differ from, say, "view source"? – Phil Freihofner Mar 29 '12 at 18:57
-
I think he wants to show screen-scraped pages and have them behave as original pages. – Diodeus - James MacFarlane Mar 29 '12 at 18:58
-
Yes, like diodeus said, but I want to be able to use scripts on that page. Like the Google search button, or the the stackoverflow vote button. For example, that I press a button in my own program that will actually click a vote button on this site (by executing the code behind that button). – ZimZim Mar 30 '12 at 08:55
2 Answers
Edit:
Ah, I see. You are asking about how to call a remote web-page from your code? There are a couple of ways you can do this:
- You can do it "by-hand" using the Java
URL
class. - You could use the great Apache HTTPClient library.
- Another possibility is a tool like HTMLUnit.
Scraping of websites is a difficult problem and rarely have I found that a single scraper can handle multiple websites. The idea of a generic scraper is just not possible.
I would recommend writing a Java interface
which is something like HandleSearchPage
. It would contain a method to scrap the page and extract some of the data and another method to submit a search.
Then you can implement your scrapers for Google, Yahoo, etc.. As to how to parse html and drive a webpage there are many other questions/answers on that specific topic.
Best of luck.
-
Oh nonononono my bad, what I meant was, I need an explanation that will make me able to do it for all sites I find, programmatically of course. I absolutely don´t expect a single java code to be able to manipulate all scripts on every website haha. I just don´t want you to give me an explanation that I will only be able to use for google.com. And thanks, I´ll look into your answer. EDIT: You gave me an explanation of how to parse HTML. Like I said, I already know how to parse HTML in several ways. What I need to do is EXECUTE scripts on an external website through my own code. – ZimZim Mar 29 '12 at 19:22
Sorry I am not too sure what the quesiton is. - If you want to grab a web page from java and then strip out the html data then that is a task that you can fairly easily do - or use something like nutch. If you want to run the javascript inside a page inside your java then you will need to look at something like rhino.
nutch will spider the pages, and update a database (usually solr) you can then issue searches against the database and display the results.

- 1,531
- 11
- 25
-
A good bit of this should be a comment dude. In these cases I say something like "I'm not sure you are talking about XXXX." Then my answer. Then "If you were talking about something else, edit your question." – Gray Mar 29 '12 at 19:34
-
Thanks for the comment Gray - I am a bit new on this site TBH - how do i do a comment - i see on this thread there is a grey Add Comment link, but there is not one under the OP post ? EDIT - ahh i need a 50 rep to add a comment . – Symeon Breen Mar 30 '12 at 11:35