Problem Background
The rev-text element is not a part of the "default" page source, it is dynamically loaded using JavaScript. Since Jsoup is not a browser simulator, it doesn't execute the script on the page it just gives you the source.
A simple way to test the source retrieved is to print it out; you will see that the rev-text class is not present at all.
System.out.println(doc.html()); //print out page source
Proposed Solution
Generally to scrape content from web pages that are JavaScript heavy it's usually useful to use a tool that can simulate a browser by executing the scripts on the page. A common library that does this is Selenium. You can use the PhantomJS (you can readup on this) driver in selenium, fetch the page, pass the page source to Jsoup and extract the rev-text.
Here is a sample code that uses selenium to extract the fields you need:
public static void main(String[] args) throws IOException, InterruptedException {
WebDriver driver = new PhantomJSDriver(new DesiredCapabilities());
driver.get("https://www.zomato.com/ahmedabad/mcdonalds-navrangpura"); //retrieve page with selenium
Thread.sleep(3*1000); //bad idea, wait for specific element. e.g rev-text class instead of using sleep[1].
Document doc = Jsoup.parse(driver.getPageSource());
driver.quit(); //quit webdriver
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links with rev-text class
Elements links = doc.getElementsByClass("rev-text");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link);
System.out.println("text : " + link.text());
}
}
}
You will need to add the selenium libraries to your class path. I'm using maven so all i added was:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>2.45.0</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-remote-driver</artifactId>
<version>2.45.0</version>
</dependency>
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version>1.2.1</version>
</dependency>
This works fine for me and extracts the reviews in the page.
- Wait for specific element in selenium