Why my Java code can get the content of some urls (webpages)?

Question

I try to get the content of some urls by using my java code. The code returns the content for some urls, for example this one: "http://www.nytimes.com/video/world/europe/100000004503705/memorials-for-victims-of-istanbul-attack.html" and it returns nothing for some others. For example this one: "http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0" When I check the url manually, I see the content and even if I view the source, I don't notice any special difference between the structure of the pages. But I still get nothing for this url.

Does it relate to any permission problem or the structure of the webpage or my java code?

Here is my code:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class TestJsoup {
  public static void main(String[] args) {
  System.out.println(getUrlParagraphs("http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0"));
}

public static String getUrlParagraphs (String url) {
try {
  URL urlContent = new URL(url);
  BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream()));
  String line;
  StringBuffer html = new StringBuffer();
  while ((line = in.readLine()) != null) {
    html.append(line);
    System.out.println("Test");
  }
  in.close();
  System.out.println(html.toString());
  return html.toString();
} catch (IOException e) {
    e.printStackTrace();
}
return null;
}
}

Andy Turner · Answer 1 · 2016-08-04T09:15:31.223

It's because the second one redirects, and you don't attempt to follow the redirection.

Try accessing it with curl -v:

$ curl -v 'http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0'
* Hostname was NOT found in DNS cache
*   Trying 170.149.161.130...
* Connected to www.nytimes.com (170.149.161.130) port 80 (#0)
> GET /2016/07/24/travel/mozart-vienna.html?_r=0 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: www.nytimes.com
> Accept: */*
> 
< HTTP/1.1 303 See Other
* Server Varnish is not blacklisted
< Server: Varnish
< Location: http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2016%2F07%2F24%2Ftravel%2Fmozart-vienna.html%3F_r%3D1
< Accept-Ranges: bytes
< Date: Thu, 04 Aug 2016 08:45:53 GMT
< Age: 0
< X-API-Version: 5-0
< X-PageType: article
< Connection: close
< X-Frame-Options: DENY
< Set-Cookie: RMID=007f0101714857a300c1000d;Path=/; Domain=.nytimes.com;Expires=Fri, 04 Aug 2017 08:45:53 UTC
< 
* Closing connection 0

You can see there is no content, and it's a 3XX return code, and has a Location: header.

Thank you Andy! You are right! It's a redirected url and when I want to open the redirected url in my browser, I have to enter my username and password, and then I can see the page. I know, how I can get the redirected code in my java code, but I don't know how to pass the "user, password" step and get the content. Do you have any idea about that? Can I simply add my user and pass to the redirected link?! — Simone, Aug 04 '16 at 10:11

MaJiD · Answer 2 · 2016-08-04T10:09:18.080

Hello, the problem is in your url, i tried you code in my machine and it's also return null, but i read the oracle doc about it and found that the problem is host, so if you change the url (for example this post link) it will work fine. my code here

package sd.nctr.majid;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class Program {

    public static void main(String[] args) {
        System.out.println(getUrlParagraphs("http://stackoverflow.com/questions/4328711/read-url-to-string-in-few-lines-of-java-code"));

    }

    public static String getUrlParagraphs (String url) {
        try {
          URL urlContent = new URL(url);
          BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream()));
          String line;
          StringBuffer html = new StringBuffer();
          while ((line = in.readLine()) != null) {
            html.append(line);
            System.out.println("Test");
          }
          in.close();
          System.out.println(html.toString());
          return html.toString();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
        }
}

Why my Java code can get the content of some urls (webpages)?

2 Answers2