0

Hi I am new to jsoup and trying to scrape data from following link,

https://www.zomato.com/ahmedabad/mcdonalds-navrangpura

but I'm not able to get data for the following class : rev-text

This is my code:

public class Test {

    public static void main(String[] args) throws IOException {
        Document doc;
        doc = Jsoup.connect("https://www.zomato.com/ahmedabad/mcdonalds-navrangpura").userAgent("Chrome/41.0.2228.0").get();

        // get page title
        String title = doc.title();
        System.out.println("title : " + title);

        // get all links
        Elements links = doc.getElementsByClass("rev-text");

    /*  Elements links = doc.getAllElements();*/
        for (Element link : links) {

            // get the value from href attribute
            System.out.println("\nlink : " + link);
            System.out.println("text : " + link.text());

        }
}
}

Please guide me on how to do this.

pelumi
  • 1,530
  • 12
  • 21
  • Are yoy trying to get reviews on this page – vab Oct 09 '15 at 10:08
  • This is because the website loads the reviews with JavaScript, and jsoup does not support javascript. You can test this by disabling JS in your browser and loading that page - it won't work. The way to work around this is to manually load the data from the url - it's `https://www.zomato.com/php/filter_reviews.php`, you will have to save the cookie you get when you get the html of the first page and send it with your request to this url to get the data for the comments. – Jonas Czech Oct 09 '15 at 12:47

1 Answers1

0

Problem Background

The rev-text element is not a part of the "default" page source, it is dynamically loaded using JavaScript. Since Jsoup is not a browser simulator, it doesn't execute the script on the page it just gives you the source.

A simple way to test the source retrieved is to print it out; you will see that the rev-text class is not present at all.

System.out.println(doc.html()); //print out page source

Proposed Solution

Generally to scrape content from web pages that are JavaScript heavy it's usually useful to use a tool that can simulate a browser by executing the scripts on the page. A common library that does this is Selenium. You can use the PhantomJS (you can readup on this) driver in selenium, fetch the page, pass the page source to Jsoup and extract the rev-text.

Here is a sample code that uses selenium to extract the fields you need:

public static void main(String[] args) throws IOException, InterruptedException {
    WebDriver driver = new PhantomJSDriver(new DesiredCapabilities());
    driver.get("https://www.zomato.com/ahmedabad/mcdonalds-navrangpura"); //retrieve page with selenium
    Thread.sleep(3*1000); //bad idea, wait for specific element. e.g rev-text class instead of using sleep[1].
    Document doc = Jsoup.parse(driver.getPageSource());
    driver.quit(); //quit webdriver

        // get page title
        String title = doc.title();
        System.out.println("title : " + title);

        // get all links with rev-text class
        Elements links = doc.getElementsByClass("rev-text");
     for (Element link : links) {
            // get the value from href attribute
            System.out.println("\nlink : " + link);
            System.out.println("text : " + link.text());

        }
    }
}

You will need to add the selenium libraries to your class path. I'm using maven so all i added was:

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.8.3</version>
    </dependency>
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>2.45.0</version>
    </dependency>
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-remote-driver</artifactId>
        <version>2.45.0</version>
    </dependency>
    <dependency>
        <groupId>com.codeborne</groupId>
        <artifactId>phantomjsdriver</artifactId>
        <version>1.2.1</version>
    </dependency>

This works fine for me and extracts the reviews in the page.

  1. Wait for specific element in selenium
Community
  • 1
  • 1
pelumi
  • 1,530
  • 12
  • 21