0

I am working in Android and using Jsoup for cwaling some data from internet. I am unable to find the exact class name where the comment lies in the below defined code. I tried with disqus_thread , dsq-content,ul-dsq-comments and dsq-comment-body by going to the source page of url but not any one returned the comments.

public static void main(String[] args) {
            Document d;
            Elements lin = null;
            String url = "http://blogs.tribune.com.pk/story/39090/i-hate-materialistic-people-beta-but-i-love-my-designer-clothes/";
            try {
                d = Jsoup.connect(url).timeout(20*1000).userAgent("Chrome").get();
                lin = d.getElementsByClass("dsq-comment-body");
                System.out.println(lin);
            } catch (IOException e) {
                    e.printStackTrace();
                }
            int i=0;
            for(Element l :lin){
                System.out.println(""+i+ " : " +l.text());
                i++;
            }
}
waqas
  • 143
  • 1
  • 4
  • 15

1 Answers1

0

That's because the HTML that makes up the comments is generated dynamically after the page has been loaded, using Javascript. When the page is loaded the comment HTML doesn't exist, so Jsoup cannot retrieve it.

To get hold of the comments you have 3 options:

1) Use a web-crawler that can execute javascript. Selenium Webdriver (http://www.seleniumhq.org/projects/webdriver/) and PhantomJS (http://phantomjs.org/) are popular options here. The former works by hooking into a browser implementation (e.g. Mozilla Firefox) and opening the browser programmatically. The latter does not open a browser and executes the javascript by using Webkit instead.

2) Intercept the network traffic when opening the site (here you can probably use your browser's built-in network tab) and find the request that fetches the comments. Make this request yourself and extract the relevant data to your application. Bear in mind that this will not work if the server serving the comments requires some kind of authentication.

3) If the comments are served by a specialized provider with an openly accessible API, then it might be possible to extract them through this API. The site you linked to uses Disqus to handle the comment section so it might be possible to hook into their API and fetch them this way.

Soggiorno
  • 760
  • 9
  • 17
  • I am making a web crawler that will work android application. If I go to option **1** that you suggested then what about the browser opening? In Android application? – waqas Aug 23 '16 at 03:06
  • The Android driver for Selenium is deprecated. For Android you should use appium (http://appium.io/) or Selendroid (http://selendroid.io/). See this related question: http://stackoverflow.com/questions/18727677/is-selenium-testing-worthwhile-on-mobile-devices. They work by opening a WebView, which supports javascript execution (although it's not enabled in the default configuration). – Soggiorno Aug 23 '16 at 12:19
  • I neither want to open the browser nor want to open the web view. Is there any method in the `Android` that extract the `Javascript` data without browser and webview? – waqas Aug 23 '16 at 13:27
  • I'm not aware of any such method. Does your app connect to a server? If so, you could have a headless browser (PhantomJS) running on that server that extracts the javascript data and sends it to the client (app). – Soggiorno Aug 23 '16 at 15:31