I want to crawl this website using java jsoup
library.
My code is as follow:
private String crawl() {
Document doc = null;
try {
doc = Jsoup.connect(getUrl()).headers(getRequestHeaders()).get();
} catch (Exception e) {
e.printStackTrace();
}
return doc.body().text();
}
private String getUrl() {
return "https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?" +
"amount=1&" +
"fee=3&" +
"fromCurr=IDR" +
"&toCurr=USD" +
"&submitButton=Calculate+exchange+rate";
}
private Map<String, String> getRequestHeaders() {
Map<String, String> headers = new HashMap<>();
headers.put("authority", "usa.visa.com");
headers.put("cache-control", "max-age=0");
headers.put("upgrade-insecure-requests", "1");
headers.put("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36");
headers.put("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3");
headers.put("accept-encoding", "gzip, deflate, br");
headers.put("accept-language", "en-US,en;q=0.9");
return headers;
}
If I try to crawl locally it works fine. But, when I deploy the code to AWS Lambda function, I got an access denied page:
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?" on this server.<P>
Reference #18.de174b17.1561156615.19dc81c4
</BODY>
</HTML>
When I tried to use curl
locally with the following command, it gives me the same error.
curl 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?amount=1&fee=3&fromCurr=IDR&toCurr=USD&submitButton=Calculate+exchange+rate' -H 'authority: usa.visa.com' -H 'cache-control: max-age=0' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-US,en;q=0.9' --compressed
I also have tried to use cookies according to the answer here but still doesn't solve the issue.
I'm suspecting the website have some kind of mechanism to protect it from being crawled. What can I do to bypass it?