Ripping html page source trouble in Java

Question

I'm trying to rip the html page source of a website to get an email. When I run the ripper/dumper or whatever you want to call it, it gets all the source code but stops at line 160 but I can manually go to the webpage>right click>click view page source then parse the text. The entire source code is a little over 200 lines. The only problem with manually going to each page and right clicking is that there are over 100k pages and it's gonna take a while.

Here's the code i'm using to get the page source:

    public static void main(String[] args) throws IOException, InterruptedException {

    URL url = new URL("http://www.runelocus.com/forums/member.php?102786-wapetdxzdk&tab=aboutme#aboutme");
    URLConnection connection = url.openConnection();

    connection.setDoInput(true);
    InputStream inStream = connection.getInputStream();
    BufferedReader input = new BufferedReader(new InputStreamReader(
            inStream));

    String html = "";
    String line = "";
    while ((line = input.readLine()) != null)
        html += line;
    System.out.println(html);
    }

score 1 · Answer 1 · answered Jul 09 '12 at 14:55

1

If you are trying to scrape the content of an HTML page, you shouldn't be using raw comnections like that. Use existing library: HTML Unit is a very common one to use.

You pass in the URL and it gives you an object representing the page and you get all the HTML mark ups as Objects (eg. You get Div object for elements, HTMLAnchor object for elements, etc). It will make your life a lot easier to use existing framework like HTML Unit and read off the content of the page on that.

You can also do searches (eg. elementById, elementByTagName, by attribute, etc) which makes jumping around the document easier given a pre-determined page mark up.

You can also simulate doing clicking, etc as you need to.

answered Jul 09 '12 at 14:55

TS-

4,311
8
40
52

I looked through that page and looked at all there tuts but couldn't find what I needed. I'm not sure you completely understood what i was asking. I already have a script that parses out the emails from the web page source. All I need is a script that can get the entire page source as opposed to the first 160 lines. but thank you for that link, it might come in handy. – Justin Beast Jul 09 '12 at 15:13
I would highly recommend you use HTML Unit rather than reading raw HTML the way you do. But your issue might have to do with timeout on the URLConnection, see if you can extend it (see http://stackoverflow.com/questions/3163693/java-urlconnection-timeout, http://stackoverflow.com/questions/2799938/httpurlconnection-timeout-question and other sources about timeout) – TS- Jul 09 '12 at 15:17
I've looked at a lot of things online and I don't think that the problem has to do anything with a connection timeout. The way I'm doing it should work it's just it doesn't get the rest of the page source. – Justin Beast Jul 09 '12 at 15:42

score 0 · Answer 2 · answered Jul 09 '12 at 15:30

0

I ran your code and it seems to be getting all the HTML including the HTML closing tag.

Did you think of the possibility that you might have to be logged in on the website to see more? In that case a library like user tsOverflow suggests might be helpful.

answered Jul 09 '12 at 15:30

reus

143
4
10

To get the info from that page you don't have to be a member or logged in although I did try that to no avail. – Justin Beast Jul 09 '12 at 15:44
Does the output of your program stop abruptly or do you see the closing tag? Maybe there are some javascripts that extend the DOM of the page. – reus Jul 09 '12 at 17:02
Yes i realized that the script stops as it hits the tag but the source code that i need is past the tag..do you know of any way to grab it? – Justin Beast Jul 09 '12 at 17:05
I mean the closing
tag at the very end of the source. If you see this, you probably did get the whole document.
– reus Jul 09 '12 at 17:09
Alright yea i just looked back at the out the script gave me and it does get the source, it just skips a bunch of code. this is the pic of the code its skipping. The highlighted part is the code i would like to grab http://img801.imageshack.us/img801/8053/614f3e711597486f8367d23.png – Justin Beast Jul 09 '12 at 17:16
Maybe it's the encoding of the page; I posted another answer. – reus Jul 09 '12 at 17:23

score 0 · Answer 3 · answered Jul 09 '12 at 16:13

0

Upon looking at this, my best guess is that your while loop conditional is bad. I'm unfamiliar with the syntax you're using. Mind you, I have not used Java in awhile. But I feel like it should read...

String line = input.readLine();
while(line != null)
{
    html += line; //should use a StringBuilder here for optimization
    line = input.readLine();
}

I do note the StringBuilder optimization. Also, I think this would be easier using the Scanner class.

answered Jul 09 '12 at 16:13

Austin DeVinney

1

I have tried using StringBuilder, StringBuffer, ArrayLists, HashLists and everything else I can think of, there is no problem with the amount of memory the string is holding, I have already tested it with a larger amount of code and it worked fine. I just can't figure out why it decides to stop at a certain point. – Justin Beast Jul 09 '12 at 16:24
Right, that was more of an aside. The only thing that I can think of is that your JVM is out of date. That loop looks weird to me, like I said... so it could be evaluating that strangely. – Austin DeVinney Jul 09 '12 at 18:05

score 0 · Answer 4 · answered Jul 09 '12 at 17:23

0

Maybe it helps when you open a InputStreamReader with a different charset? Looking at the page you mention, the charset is ISO-8859-1:

BufferedReader input = 
    new BufferedReader(new InputStreamReader(inStream, "ISO-8859-1"));

answered Jul 09 '12 at 17:23

reus

143
4
10

i just tried that encoding to no avail. there has to be a way to read the javascript. if you havent figured out yet i dont really know anything about html or java script and im not the greatest in java – Justin Beast Jul 09 '12 at 17:26
Maybe you can look with Firebug (a Firefox add-on) with the net panel if extra requests are being made when opening the page. If the info you are looking for is in such a request's response, then open a Connection with Java to the concerning url. – reus Jul 09 '12 at 17:38
I'm not exactly sure what you are suggesting. What I need this to do is "crawl" the page and get all the source code by itself which it does, i'm just not sure why it's skipping all the javascript on the page – Justin Beast Jul 09 '12 at 17:45
Perhaps the e-mail address you are looking for is written into the page dynamically with Javascript (to prevent crawling). In that case, reading the source with Java like you did won't suffice. The Javascript needs to be interpreted. Maybe it's possible with the library that @tsOverflow suggested. – reus Jul 09 '12 at 17:59

Ripping html page source trouble in Java

4 Answers4