1

I use Jsoup to scrap the website:

doc = Jsoup.connect(String.valueOf(urls[0])).userAgent("Mozilla").get();    

Here is the link:

http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

I have added rpp=40 parameter to the link in the command line to display 40 results per page. I'm able to see all the results in page view source. I know that Jsoup is for the static content only and cannot fetch the websites that use AJAX/JS Libraries technique to generate content. However why Jsoup cannot retrieve the same content as I can see in the browser via page view source? Page view source shows 40 results whereas Jsoup is able to retrieve elements from only 10 results? How can I obtain every elements visible via page view source.

Marcin S.
  • 11,161
  • 6
  • 50
  • 63

1 Answers1

1

Short answer Jsoup can't execute the Javascript.

Long answer

http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

The webpage your are looking for accepts the Http Get with the parameters. In the normal browser it accepts the params and loads the page . But Not with willowbrook checked(in your example). It loads the JS after it loads the page and the Javascript does the check box for Fliters the serach results. Therefore when you use Jsoup you are getting more results because it loads 'state=IL' without 'willowbrook' filtered.

wtsang02
  • 18,603
  • 10
  • 49
  • 67
  • Thanks for the explanation. I get more results only when use page view source but jsoup shows only ten results. How can I execute JavaScript then? – Marcin S. Feb 17 '13 at 23:23
  • [You can't](http://stackoverflow.com/questions/7344258/jsoup-java-html-parser-executing-javascript-events). You are getting less result maybe because their application logic isn't what I stated in my answer. But the idea is , the Javascript plays around with the data. – wtsang02 Feb 17 '13 at 23:50