1

And I get a program which should be used to get content for html.

public class University {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

        URLConnection spoof = url.openConnection();        
        // Spoof the connection so we look like a web browser
        spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

        String connect = url.toString();
        Document doc = Jsoup.connect(connect).get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

However, there seems to have Http header in blocking to reach the content. Thus, I have created the following program to get the header of the html site.

public class Get_Header {
  public static void main(String[] args) throws Exception {
    URL url = new URL("http://www.4icu.org/reviews/index2.htm");
    URLConnection connection = url.openConnection();

    Map responseMap = connection.getHeaderFields();
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
      String key = (String) iterator.next();
      System.out.println(key + " = ");

      List values = (List) responseMap.get(key);
      for (int i = 0; i < values.size(); i++) {
        Object o = values.get(i);
        System.out.println(o + ", ");
      }
    }
  }
}

It retunrs the following result.

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8, 

Though I can get the header, but how should I combine the code to form a complete one?

Great Thanks in Advnace.

Kennedy Kan
  • 273
  • 1
  • 7
  • 20

2 Answers2

1

You can use the Response class to get the page you need, use it to display the headers and then convert it to Document to extract the text you need:

Connection.Response response = Jsoup.connect("http://www.4icu.org/reviews/index2.htm")
            .userAgent("Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)")
            .method(Connection.Method.GET)
            .followRedirects(false)
            .execute();

Document doc = response.parse();
Elements cells = doc.select("td.i");
Iterator<Element> iterator = cells.iterator();

while (iterator.hasNext()) {
    Element cell = iterator.next();
    String university = cell.select("a").text();
    String country = cell.nextElementSibling().select("img").attr("alt");
    System.out.printf("country : %s, university : %s %n", country, university);
}
System.out.println(response.headers());
TDG
  • 5,909
  • 3
  • 30
  • 51
1

The "User-Agent" property which you set on the URL seems to be lost when you convert it back to a String again.

Setting the user-agent on the JSoup connection seems to work:

public static void main(String[] args) throws Exception {
    System.out.println("Started");

    String url = "http://www.4icu.org/reviews/index2.htm";
    Document doc = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements cells = doc.select("td.i");

    Iterator<Element> iterator = cells.iterator();

    while (iterator.hasNext()) {
        Element cell = iterator.next();
        String university = cell.select("a").text();
        String country = cell.nextElementSibling().select("img").attr("alt");

        System.out.printf("country : %s, university : %s %n", country, university);
    }
}
ebo
  • 2,717
  • 1
  • 27
  • 22
  • That words perfectly for most of the extraction. However, there are some university with some special characters are not possible to extract. For example, Abant Izzet Baysal Üniversitesi, containing a special Ü, will be extracted as Ü. How should I make it change back to its original character during data extraction? – Kennedy Kan Jul 30 '16 at 15:22
  • Glad to hear that the answer is useful. The encoding issue seems to be a different question. This [question](http://stackoverflow.com/questions/994331/java-how-to-unescape-html-character-entities-in-java/37277534) might be useful. Note that according to [this](http://stackoverflow.com/a/37277534/13226) answer this situation is already handled by JSoup. – ebo Aug 01 '16 at 17:48