0

Using ScrapingUtils, I am parsing some URLs. To do this I am using the following code:

String link = "Here the link";                                           
Document doc = ScrapingUtils.visit(link, false);

if (doc != null) {
   //code
} else {
   //code
}

The problem is that sometimes it not able to receive the HTML from the client, and cannot take the data. I have tried with try..catch so that if there is some read timeout error, I can give specific value to the variables to know there has been an error.

I have tried with this:

String link = "Here the link";                                           
Document doc = ScrapingUtils.visit(link, false);

try {
    if (doc != null) {
       //code
    } else {
       //code
    }
catch (TimeoutException exception) {
    throw new TimeoutException("Timeout exceeded: " + timeout + unit);
}

But I receive an error when using TimeoutException exception sentence:

TimeoutException exception is never thrown in body of corresponding try statement

I understand that java knows that this exception is pointless because it will never happen.

ScrapingUtils class:

public class ScrapingUtils {
    private static final Logger logger = LoggerFactory.getLogger(ScrapingUtils.class);

    public static Document visit(String urlStr, boolean useProxy) {
        Document doc = null;
        try {
            if (!useProxy) {
                logger.info("Downloading " + urlStr);
                doc = Jsoup.connect(urlStr).userAgent("Mozilla/5.0").maxBodySize(0).timeout(Config.CONNECTION_TIMEOUT).get();
            } else {
                logger.info("downloading " + urlStr);
                URL url = new URL(urlStr);

                String[] proxyStr = NetUtils.getProxy().split(":");
                Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyStr[0], Integer.parseInt(proxyStr[1])));
                HttpURLConnection conn = (HttpURLConnection) url.openConnection(proxy);
                conn.setConnectTimeout(Config.CONNECTION_TIMEOUT);
                conn.connect();

                BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
                StringBuilder buffer = new StringBuilder();
                String str;

                while((str = br.readLine()) != null) {
                    buffer.append(str);
                }

                doc = Jsoup.parse(buffer.toString());
            }   
        } catch (IOException ex) {
            logger.error("Error downloading website " + urlStr + "\n" + ex.getMessage());
        }
        return doc;
    }

    public static Document visit(String urlStr) {
        return visit(urlStr, false);
    }
}
Adelin
  • 7,809
  • 5
  • 37
  • 65
JetLagFox
  • 240
  • 4
  • 10

1 Answers1

0

Ok. So far you will never get a TimeOutException in your code. But you will get a SocketTimeoutException in this lines

doc = Jsoup.connect(urlStr).userAgent("Mozilla/5.0").maxBodySize(0).timeout(Config.CONNECTION_TIMEOUT).get();

and

conn.connect();

So far you can handle the exception in here like this

try {
            if (!useProxy) {
                Jsoup.connect("https://docs.oracle.com").userAgent("Mozilla/5.0").maxBodySize(0).timeout(1000).get();
            } else {
                URL url = new URL("https://docs.oracle.com");
                Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("", 11));
                HttpURLConnection conn = (HttpURLConnection) url.openConnection(proxy);
                conn.setConnectTimeout(1000);
                conn.connect();

                BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
                StringBuilder buffer = new StringBuilder();
                String str;

                while ((str = br.readLine()) != null) {
                    buffer.append(str);
                }
            }
        } catch (SocketTimeoutException a) {
            System.out.println("log");
        } catch (IOException ex) {
        }

I modify the code for work on my side and getting the SocketTimeOut. And if you want to catch always the ScoketTimeOutException only throw in:

catch (SocketTimeoutException a) {
            System.out.println("log");
            throw  new SocketTimeoutException();
}

With this you will force the method to be inside a try/catch or and exception to the method signature

try {
            visit("test", true);
        } catch (SocketTimeoutException e) {
            e.printStackTrace();
        }
Gatusko
  • 2,503
  • 1
  • 17
  • 25