JSoup Not Transversing Entire HTML Page

Question

I am trying to scrape a webpage but for some reason it appears that I am only able to transverse up to a certain point on the page. I've printed the entire doc to file to ensure that the element I need is there (I know sometimes some code isn't capture because of JavaScript ect.). After outputting the html code to a text file, I was able to verify that the data I needed was successfully captured by JSoup.

I've tried increasing the timeout and maxbody size to ensure that its not limited there.

Can anyone point out what I'm missing?

doc = Jsoup.connect("https://www.mississaugahardware.com/products?keyword=dcf680n1&mainc=")
            .header("Accept-Encoding", "gzip, deflate")
            .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
            .maxBodySize(0)
            .timeout(600000)
            .get();


    Elements info = doc.select("span[class=PriceListModeBig");

I was able to pull values for elements near the top of the page but not further down.

Your selector is wrong... when you use classes (you can use the way you are doing it, but it's not the common way) you must use this `span.PriceListModeBig`. Besides that, you are not closing the bracket. — Eric Martinez, Jun 28 '15 at 00:26
I tried to crawl the website in your code example, I fixed the selector and I still got nothing. So I'm guessing that the website is loading the products asynchronously. If that's the case you won't be able to crawl it with jsoup. — Eric Martinez, Jun 28 '15 at 00:47
@EricMartinez I had the same result. When I send the entire value of doc to a output file, I was able to see the class and the value in it. Would that not mean that it is loading it properly? — user818502, Jun 28 '15 at 00:50
I can't tell you for sure, it makes sense what you say but apparently it doesn't work that way. See this thread http://stackoverflow.com/questions/20633294/fetch-contentsloaded-through-ajax-call-of-a-web-page — Eric Martinez, Jun 28 '15 at 01:03
please consider accepting my answer if your question is fully resolved. — luksch, Jun 30 '15 at 09:14
@luksch Sorry for the delay, tried it and it worked great! Thanks — user818502, Jul 06 '15 at 22:00

luksch · Accepted Answer · 2015-06-28T18:21:43.243

Your request returns a document that contains this pseudo html line:

<td><span class=&quot;PriceListModeBig&quot;>$99.00 CAD <span class=&quot;productitalic&quot;></span></td>

Note the " in the line!

This is because the HTML you try to parse is actually the value tag of an input element with id dnn_ctr306650_ViewLayoutManager_SCESideMenu_2_hSearchResult. I am not sure if that id is stable of changes with requests. While I was testing it seemed to be stable, but it may also depend on the input parameter of the request. I did not investigate this.

Jsoup does not interpret this too well it seems. It is strange of course that the webserver returns such stuff, but there it is. I solved this by getting the value of the input and parsing the result again with Jsoup:

doc = Jsoup.connect("https://www.mississaugahardware.com/products?keyword=dcf680n1&mainc=")
                .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                .maxBodySize(0)
                .timeout(600000).get();

Element el = doc.select("#dnn_ctr306650_ViewLayoutManager_SCESideMenu_2_hSearchResult").first();
String innerHtml = el.attr("value");        
Document docInner = Jsoup.parse(innerHtml);

Elements info = docInner.select("span.PriceListModeBig");

JSoup Not Transversing Entire HTML Page

1 Answers1