1

I am trying to scrape a webpage but for some reason it appears that I am only able to transverse up to a certain point on the page. I've printed the entire doc to file to ensure that the element I need is there (I know sometimes some code isn't capture because of JavaScript ect.). After outputting the html code to a text file, I was able to verify that the data I needed was successfully captured by JSoup.

I've tried increasing the timeout and maxbody size to ensure that its not limited there.

Can anyone point out what I'm missing?

doc = Jsoup.connect("https://www.mississaugahardware.com/products?keyword=dcf680n1&mainc=")
            .header("Accept-Encoding", "gzip, deflate")
            .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
            .maxBodySize(0)
            .timeout(600000)
            .get();


    Elements info = doc.select("span[class=PriceListModeBig");

I was able to pull values for elements near the top of the page but not further down.

user818502
  • 711
  • 6
  • 12
  • Your selector is wrong... when you use classes (you can use the way you are doing it, but it's not the common way) you must use this `span.PriceListModeBig`. Besides that, you are not closing the bracket. – Eric Martinez Jun 28 '15 at 00:26
  • I tried to crawl the website in your code example, I fixed the selector and I still got nothing. So I'm guessing that the website is loading the products asynchronously. If that's the case you won't be able to crawl it with jsoup. – Eric Martinez Jun 28 '15 at 00:47
  • @EricMartinez I had the same result. When I send the entire value of doc to a output file, I was able to see the class and the value in it. Would that not mean that it is loading it properly? – user818502 Jun 28 '15 at 00:50
  • I can't tell you for sure, it makes sense what you say but apparently it doesn't work that way. See this thread http://stackoverflow.com/questions/20633294/fetch-contentsloaded-through-ajax-call-of-a-web-page – Eric Martinez Jun 28 '15 at 01:03
  • please consider accepting my answer if your question is fully resolved. – luksch Jun 30 '15 at 09:14
  • @luksch Sorry for the delay, tried it and it worked great! Thanks – user818502 Jul 06 '15 at 22:00

1 Answers1

2

Your request returns a document that contains this pseudo html line:

<td><span class=&quot;PriceListModeBig&quot;>$99.00 CAD <span class=&quot;productitalic&quot;></span></td>

Note the &quot; in the line!

This is because the HTML you try to parse is actually the value tag of an input element with id dnn_ctr306650_ViewLayoutManager_SCESideMenu_2_hSearchResult. I am not sure if that id is stable of changes with requests. While I was testing it seemed to be stable, but it may also depend on the input parameter of the request. I did not investigate this.

Jsoup does not interpret this too well it seems. It is strange of course that the webserver returns such stuff, but there it is. I solved this by getting the value of the input and parsing the result again with Jsoup:

doc = Jsoup.connect("https://www.mississaugahardware.com/products?keyword=dcf680n1&mainc=")
                .header("Accept-Encoding", "gzip, deflate")
                .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
                .maxBodySize(0)
                .timeout(600000).get();

Element el = doc.select("#dnn_ctr306650_ViewLayoutManager_SCESideMenu_2_hSearchResult").first();
String innerHtml = el.attr("value");        
Document docInner = Jsoup.parse(innerHtml);

Elements info = docInner.select("span.PriceListModeBig");
luksch
  • 11,497
  • 6
  • 38
  • 53