1

I want to crawl this website using java jsoup library.

My code is as follow:

  private String crawl() {
    Document doc = null;
    try {
      doc = Jsoup.connect(getUrl()).headers(getRequestHeaders()).get();
    } catch (Exception e) {
      e.printStackTrace();  
    }

    return doc.body().text();
  }

  private String getUrl() {
    return "https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?" +
        "amount=1&" +
        "fee=3&" +
        "fromCurr=IDR" +
        "&toCurr=USD" +
        "&submitButton=Calculate+exchange+rate";
  }

  private Map<String, String> getRequestHeaders() {
    Map<String, String> headers = new HashMap<>();
    headers.put("authority", "usa.visa.com");
    headers.put("cache-control", "max-age=0");
    headers.put("upgrade-insecure-requests", "1");
    headers.put("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36");
    headers.put("accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3");
    headers.put("accept-encoding", "gzip, deflate, br");
    headers.put("accept-language", "en-US,en;q=0.9");

    return headers;
  }

If I try to crawl locally it works fine. But, when I deploy the code to AWS Lambda function, I got an access denied page:

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http&#58;&#47;&#47;usa&#46;visa&#46;com&#47;support&#47;consumer&#47;travel&#45;support&#47;exchange&#45;rate&#45;calculator&#46;html&#63;" on this server.<P>
Reference&#32;&#35;18&#46;de174b17&#46;1561156615&#46;19dc81c4
</BODY>
</HTML>

When I tried to use curl locally with the following command, it gives me the same error.

curl 'https://usa.visa.com/support/consumer/travel-support/exchange-rate-calculator.html?amount=1&fee=3&fromCurr=IDR&toCurr=USD&submitButton=Calculate+exchange+rate' -H 'authority: usa.visa.com' -H 'cache-control: max-age=0' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-US,en;q=0.9' --compressed

I also have tried to use cookies according to the answer here but still doesn't solve the issue.

I'm suspecting the website have some kind of mechanism to protect it from being crawled. What can I do to bypass it?

edmundpie
  • 1,111
  • 1
  • 13
  • 24
  • 1
    If a site doesn't want to be crawled don't force it to be crawled. – Benjamin Urquhart Jun 21 '19 at 23:39
  • If you want to access visa exchange rates through api programatically, you can follow with the documentation here, https://developer.visa.com/capabilities/foreign_exchange – Kannaiyan Jun 22 '19 at 03:10
  • @Kannaiyan it requires me to submit official contract and company registration to be able to use the production api. – edmundpie Jun 22 '19 at 12:31
  • @edmundpie If you don't do that, you will pay penalty for abuse, which will be higher than contract price. Those are meant with proper reasons, not be hijacked. – Kannaiyan Jun 22 '19 at 16:03
  • can you access other websites from the AWS cloud, maybe via another curl? Possibility 1: you are not allowed to access external websites with AWS lambda functions. Possibility 2: visa.com blocks access from AWS IP addresses. Check their regulations about scraping. They may have policies in place that you need to follow. if it is allowed but blocked anyway, maybe you should start contemplating using proxy services. But really keep in mind that you need to check the policies of visa.com concerning scraping their sites. – luksch Jun 23 '19 at 13:11
  • according to https://usa.visa.com/robots.txt, it is not disallowed to crawl the url – edmundpie Jul 20 '19 at 19:29

0 Answers0