6

I am a beginner to crawling. I have a requirement to fetch the posts and comments from a link. I want to automate this process. I considered using webcrawler and jsoup for this but was told that webcrawlers are mostly used for websites with greater depth.

Sample for a page: Jive community website

For this page, when I view the source of the page, I can see only the post and not the comments. Think this is because comments are fetched through an AJAX call to the server.

Hence, when I use jsoup, it doesn't fetch the comments.

So how can I automate the process of fetching posts and comments?

Kara
  • 6,115
  • 16
  • 50
  • 57
Adarsh Konchady
  • 2,577
  • 5
  • 30
  • 50
  • All the Comments is loaded from jive Database, there is no hidden links to get to a raw text for the comments. There should be a link i have searched (if you knw java script maybe u can knw were they calling it from here:https://www.heylululemon.com/6.0.2.0/resources/scripts/gen/b0e45f40028721e48611c14803fef20d.js) which is called from the site to get the comments but i didn't see. Have you tried web view capabilities. – ImGeorge Dec 17 '13 at 17:24
  • Possible duplicate of [Jsoup Java HTML parser : Executing javascript events](http://stackoverflow.com/questions/7344258/jsoup-java-html-parser-executing-javascript-events) – Pshemo Feb 23 '16 at 20:09

2 Answers2

12

Jsoup is a html parser only. Unfortunately it's not possible to parse any javascript / ajax content, since jsoup can't execute those.

The solution: using a library which can handle Scripts.

Here are some examples i know:

If such a library doesn't support parsing or selectors, you can at least use them to get Html out of the scripts (which then can be parsed by jsoup).

ollo
  • 24,797
  • 14
  • 106
  • 155
2

Jsoup does not handle with Javascript and Ajax, so you need to use Htmlunit or selenium. After loading page using Htmlunit or any you can use jsoup for rest of task.

Gaurab Pradhan
  • 281
  • 1
  • 5
  • 14