Using ScrapingUtils
, I am parsing some URLs. To do this I am using the following code:
String link = "Here the link";
Document doc = ScrapingUtils.visit(link, false);
if (doc != null) {
//code
} else {
//code
}
The problem is that sometimes it not able to receive the HTML from the client, and cannot take the data. I have tried with try..catch
so that if there is some read timeout error, I can give specific value to the variables to know there has been an error.
I have tried with this:
String link = "Here the link";
Document doc = ScrapingUtils.visit(link, false);
try {
if (doc != null) {
//code
} else {
//code
}
catch (TimeoutException exception) {
throw new TimeoutException("Timeout exceeded: " + timeout + unit);
}
But I receive an error when using TimeoutException
exception sentence:
TimeoutException exception is never thrown in body of corresponding try statement
I understand that java knows that this exception is pointless because it will never happen.
ScrapingUtils class:
public class ScrapingUtils {
private static final Logger logger = LoggerFactory.getLogger(ScrapingUtils.class);
public static Document visit(String urlStr, boolean useProxy) {
Document doc = null;
try {
if (!useProxy) {
logger.info("Downloading " + urlStr);
doc = Jsoup.connect(urlStr).userAgent("Mozilla/5.0").maxBodySize(0).timeout(Config.CONNECTION_TIMEOUT).get();
} else {
logger.info("downloading " + urlStr);
URL url = new URL(urlStr);
String[] proxyStr = NetUtils.getProxy().split(":");
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyStr[0], Integer.parseInt(proxyStr[1])));
HttpURLConnection conn = (HttpURLConnection) url.openConnection(proxy);
conn.setConnectTimeout(Config.CONNECTION_TIMEOUT);
conn.connect();
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
StringBuilder buffer = new StringBuilder();
String str;
while((str = br.readLine()) != null) {
buffer.append(str);
}
doc = Jsoup.parse(buffer.toString());
}
} catch (IOException ex) {
logger.error("Error downloading website " + urlStr + "\n" + ex.getMessage());
}
return doc;
}
public static Document visit(String urlStr) {
return visit(urlStr, false);
}
}