0

I have a long list of links (absolute URLs) stored in a text file. I need to find out where the link is dead (Web page does not exist any more under the given adress). Example

Android                   http://www.android.com/
stackoverflow             https://stackoverflow.com/
AIMS Desktop              https://desktop.aims.ac.za/
google                    http://www.google.com/

blahblah                  http://www.ffgfgfgkzu.com

I do not care if there is a redirect, from http to https (if i type http://www.google.com/ in my browser this would be redirected to https://www.google.com/) or any other page which can have a total different url. I am only interested in finding out dead links, like the last entry above, where my browser also result in: (german text for page not found) enter image description here

I have looked in to Selenium and some other web scraping tutorials. I don't want to scrape any content. I only need to remove the dead links from my list.

wannaBeDev
  • 516
  • 3
  • 14
  • Does doing [this](https://docs.oracle.com/javase/tutorial/networking/urls/readingURL.html) (from oracle tutorials) and catching the Exception work for you (if exception, bad url… ignoring auth issues, etc. Assuming none of the URLs would have those issues). – BeUndead Jun 05 '21 at 00:23
  • https://stackoverflow.com/questions/1378199/how-to-check-if-a-url-exists-or-returns-404-with-java – RealHowTo Jun 05 '21 at 00:31
  • @BeUndead I will try it. Thanks for your suggestion. – wannaBeDev Jun 05 '21 at 00:37
  • @RealHowTo Thanks for your reply. Those answers are realy old. I thought there might be something easy (a one or two liner) where i wouldn't need to care about header, statuscode and whatsoever... But if that is the way i will try it. – wannaBeDev Jun 05 '21 at 00:40
  • There's a perfectly good two-line solution there. – user207421 Jun 05 '21 at 06:52

2 Answers2

2

You can send HEAD request to the URL and see what response code you are getting. If response code is 404 then you say the URL is not exists. The HEAD request is much faster than GET. Also, The HEAD request will not return response body. This is a standard way to check if URL is exists or not. Please see below code snippet which uses apache HttpClient version 4.5.5 to check if URL exists or not:

/**
 * @param url
 * @return
 */
public static boolean isReachable(String url) {
    boolean isReachable = true;
    try (CloseableHttpClient httpClient = HttpClients.custom()
        .setSSLContext(new SSLContextBuilder().loadTrustMaterial(null, TrustAllStrategy.INSTANCE).build())
        .setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE)
        .build())
    {
        HttpHead request = new HttpHead(url);
        CloseableHttpResponse response = httpClient.execute(request);

        if (response.getStatusLine().getStatusCode() == 404) {
            System.out.println("URL " + url + " Not found");
            isReachable = false;
        }
    } catch (Exception e) {
        e.printStackTrace();
        isReachable = false;
    }

    return isReachable;
}
pcsutar
  • 1,715
  • 2
  • 9
  • 14
0

InetAddress has a function to check for availability InetAddress.isReachable():

boolean reachable = InetAddress.getByName(host).isReachable();

Or if your prefer to catch and check exceptions you can also use Sockets.

public static boolean pingHost(String host, int port, int timeout) {
    try (Socket socket = new Socket()){ 
        socket.connect(new InetSocketAddress(host, port), timeout); 
        return true; 
     } catch (IOException e) { 
        return false;  
     } 
}

Or HttpUrlConnection:

public static boolean isInternetReachable(String urlStr)
{
    try {
        URL url = new URL(urlStr);

        HttpURLConnection urlConnection = (HttpURLConnection)url.openConnection();

        urlConnection.setInstanceFollowRedirects(true);

        Object objData = urlConnection.getContent();

    } catch (Exception e) {              
        e.printStackTrace();
        return false;
    }

    return true;
}
deepakchethan
  • 5,240
  • 1
  • 23
  • 33
  • I just have the links as string, like given above. What should i put for the port? Is "Host" same as the whole url or only a prt of it? – wannaBeDev Jun 05 '21 at 00:44
  • host is part of it like "www.android.com" and port you can default to 80 or 443. If you want a generic solution, i suggest use the HTTPUrlConnection – deepakchethan Jun 05 '21 at 01:00
  • `InetAddress` is not sufficient to find specific URLs that are dead links. `HttpURLConnection` is, but it should be `setInstanceFollowRedirects(false)` according to the question. And there is no need to fetch the entire content. Just use the `HEAD` method and get the response code. – user207421 Jun 05 '21 at 01:25
  • @user207421 The question states that they don't care about redirects but do care about dead links. If the url redirects to dead link it should return false. Thats is why using`setInstanceFollowRedirects(true)` – deepakchethan Jun 05 '21 at 01:26
  • The question states that he wants to know about dead links. A successful redirect from a dead link conceals that and gives you something else instead, such as the frequently seen 'this domain is for sale'. – user207421 Jun 05 '21 at 01:30
  • @user207421 In the scenario that you mentioned, you won't see the browser "page not found" error page. Instead you see google.com and that is expected to be returned as true. If you re-read the question. He is okay with site redirecting to any url. That includes this domain is for sale url – deepakchethan Jun 05 '21 at 01:32