0

I'm trying to parse the web page padalvarigal.com to get the URL of all the results from the page(highlighted in green colour). But when I'm parsing the web page using Jsoup, I'm not getting the entire divs while printing the doc object. The URL's and Titles in the div id "hits" are also getting replaced with "{{{URL}}}", "{{{Title}}}" in the doc object which I'm printing in consoleIDE screen shot. Also out of six divs with class name hit in the actual pageChrome Dev Console screen shot I'm getting only one div named hit in the parsed page.

I have also tried setting the maxBodySize() to 0 for getting the entire web page results but still getting the same problem. Please guide me on whats going wrong.

package com.balaji.parse;
import org.jsoup.Jsoup;
import org.jsoup.Connection;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseHTML {
    private static final String URL = "http://www.paadalvarigal.com/search/?q=naanum%20rowdythan";
    public static void main(String args[]) {
        //parseFromString();
        parseFromHTML();
    }

    private static void parseFromString() {
        String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";

        Document doc = Jsoup.parse(html);
        System.out.println(doc.head());
        System.out.println(doc.title());
        System.out.println(doc.body());
        //To Parse only body tag and elements - adds HTML and Body tags.
        System.out.println("Parsing only Body");
        Document doc2 = Jsoup.parseBodyFragment(html);
        System.out.println(doc2);
    }

    private static void parseFromHTML() {
        try {
            Connection con = Jsoup.connect(URL);
            con.timeout(5000);
            con.header("Accept-Encoding", "gzip, deflate");
            con.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0");
            con.maxBodySize(0);

            Document doc = con.get();
            System.out.println(doc.head());
            System.out.println(doc.title());
            System.out.println(doc);

        } catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}

P.S: I'm a newbie to JSoup and I'm trying to learn the framework for personal projects.

balajivijayan
  • 849
  • 9
  • 11

2 Answers2

2

You don't need to use jsoup to get the search results for this website.

If you look at the Network tab of the Chrome Developer Tools, you can see that when you load a page, a POST to an endpoint with an specific JSON content:

{"requests":[{"indexName":"song","params":"query=naanum%20rowdythan&hitsPerPage=7&maxValuesPerFacet=7&page=0&facets=%5B%22singers%22%2C%22Lyrics%20By%22%2C%22Music%20By%22%2C%22Singers%22%5D&tagFilters="}]}

You can see that q=naanum%20rowdythan is part of the JSON.

This is the response you will get:

{
  "results": [
    {
      "hits": [
        {
          "Title": "Varavaa Varavaa",
          "Movie": "Naanum Rowdydhaan",
          "Lyrics By": [
            "Vignesh Shivan"
          ],
          "Music By": [
            "Anirudh"
          ],
          "Singers": [
            "Anirudh Ravichander",
            "Vignesh Shivan"
          ],
          "Img": "http://www.paadalvarigal.com/wp-content/uploads//NaanumRowdydhaan-70x53.jpg",
          "URL": "http://www.paadalvarigal.com/3598/varavaa-varavaa-naanum-rowdydhaan-song-lyrics.html",
          "objectID": "3598",
          "_highlightResult": {
            "Title": {
              "value": "Varavaa Varavaa",
              "matchLevel": "none",
              "matchedWords": []
            },
            "Movie": {
              "value": "<em>Naanum</em> <em>Rowdydhaa</em>n",
              "matchLevel": "full",
              "matchedWords": [
                "naanum",
                "rowdythan"
              ]
            }
          }
        },

Here's an screenshot of the Chrome's dev tool: enter image description here

So all you need to is:

  1. Send a POST request to the endpoint with the modified body (to accommodate for your query) (Sending HTTP POST Request In Java) and get the response back which is a JSON.
  2. Parse the JSON result to get what you need (How to parse JSON in Java)
Community
  • 1
  • 1
1

Take a look at the source of the page, not using Firebug or Developer Tools, but the good old right-click -> view source.

The source should match your Jsoup output. There seems to be a script (loaded by the page) that replaces the {{URL}},{{Title}}, ... templates with real data.

Jsoup will not do this for you - it will not execute any client side scripts. You will have to find an other way to get to the data. With a little bit of digging, you can probably find something in the loaded scripts.

nyname00
  • 2,496
  • 2
  • 22
  • 25