-1

While building a crawl server, I asked a question because I couldn't solve the problem of a 403 error occurring in the distribution environment, which worked normally in the local environment.

You may be busy, but please take a look and give me feedback.

I'm so frustrated because I haven't been able to solve it for days.

environment

  • GKE
  • Java, Spring boot 3.0.7
  • selenium

Error

enter image description here

[http-nio-8084-exec-1] [2023-08-05 23:40:28,741] [ERROR] [SolvedCrawling.java:136] - <html><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>ERROR: The request could not be satisfied</title>
</head><body>
<h1>403 ERROR</h1>
<h2>The request could not be satisfied.</h2>
<hr noshade="" size="1px">
Request blocked.
We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
<br clear="all">
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
<br clear="all">
<hr noshade="" size="1px">
<pre>Generated by cloudfront (CloudFront)
Request ID: -ok_4ED_rixCpeCLsp6ytvEtEjyMwZSMUb2VxVa10USNMDizAGzXbg==
</pre>
<address>
</address>
</body></html>

Code

  • RestTemplate
@Bean
public RestTemplate restTemplate() {
    RestTemplate restTemplate;
    try {
        restTemplate = new RestTemplate(clientHttpRequestFactory());
    }catch (Exception e){
        restTemplate = new RestTemplate();
    }
    restTemplate.setInterceptors(Collections.singletonList(
            (request, body, execution) -> {
                HttpHeaders headers = request.getHeaders();
                headers.setContentType(APPLICATION_JSON);
                headers.setAccept(Collections.singletonList(APPLICATION_JSON));
                headers.add(HttpHeaders.ACCEPT, "application/json");
                headers.add(HttpHeaders.USER_AGENT, "Mozilla/5.0");
                return execution.execute(request, body);
            }
    ));
    return restTemplate;
}

private HttpComponentsClientHttpRequestFactory clientHttpRequestFactory() {
    return new HttpComponentsClientHttpRequestFactory();
}
  • API Request
public String getSubject(int problemId) throws Exception{
    String jsonString = null;

    try {
        jsonString = restTemplate.getForObject("https://solved.ac/api/v3/problem/show?problemId=" + problemId, String.class);
    }catch (Exception e){
        e.printStackTrace();
        throw new HttpResponseException("fail.");
    }

    JSONParser jsonParser = new JSONParser();
    Object jsonObject = null;
    try {
        jsonObject = jsonParser.parse(jsonString);
    } catch (ParseException e) {
        e.printStackTrace();
    }

    JSONObject jsonBody = (JSONObject) jsonObject;

    return jsonBody.get("titleKo").toString();
}
  • Crawling
public BaekJoonDto profileCrawling(String baekjoonId) throws IOException, InterruptedException {
    WebDriver driver = setDriver();
    sleep(1000);
    driver.get(SOLVED_BASE_URL + SOLVED_PROFILE + baekjoonId);

    By solvedListBy = By.xpath("//*[@id=\"__next\"]/div[3]/div/div[6]/div[3]/div/table/tbody");
    sleep(1000);
    try {
        wait(driver, solvedListBy);
    } catch (TimeoutException | NoSuchElementException e) {
        log.error("{}", driver.getCurrentUrl());
        log.error("{}", driver.getPageSource());
        throw new CrawlingException("User NotFound.");
    }
    WebElement elements = driver.findElement(solvedListBy);
    WebElement webElement = elements.findElement(By.className("css-1ojb0xa"));
    int bronze = getUserSolvedCount(webElement, By.xpath("//*[@id=\"__next\"]/div[3]/div/div[6]/div[3]/div/table/tbody/tr[1]/td[2]/b"));

    driver.quit();
    return new BaekJoonDto(bronze);
  • driver setting
private WebDriver setDriver() throws IOException, InterruptedException {
        String os = System.getProperty("os.name").toLowerCase();

        if (os.contains("win")) {
            System.setProperty("webdriver.chrome.driver", "drivers/chromedriver_win.exe");
        } else if (os.contains("mac")) {
            Process process = Runtime.getRuntime().exec("xattr -d com.apple.quarantine drivers/chromedriver_mac");
            process.waitFor();
            System.setProperty("webdriver.chrome.driver", "drivers/chromedriver_mac");
        } else if (os.contains("linux")) {
            System.setProperty("webdriver.chrome.driver", "/usr/bin/chromedriver-linux64/chromedriver");
        }

        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.addArguments("--disk-cache-size=0");
        chromeOptions.addArguments("--media-cache-size=0");
        chromeOptions.addArguments("--headless=new");
        chromeOptions.setHeadless(true);
        chromeOptions.addArguments("--no-sandbox");
        chromeOptions.addArguments("--disable-dev-shm-usage");
        chromeOptions.addArguments("--disable-gpu");
        chromeOptions.addArguments("--remote-allow-origins=*");
        chromeOptions.addArguments("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537");
        // binary setting in local
       // chromeOptions.setBinary("/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"); local
//.      binary setting in deploy
        chromeOptions.setBinary("/usr/bin/google-chrome"); 
        return new ChromeDriver(chromeOptions);
    }
  1. user-agent setting
박우영
  • 1
  • 1

1 Answers1

0

This CloudFront 403 error message...

403 ERROR CloudFront

...implies that requested content doesn't match the specified conditions, then the content is blocked by WAF.


Details

In case of CloudFront 403 error you would observe the error contains a message similar to: "Request blocked. We can't connect to the server for this app or website at this time." The Server response header contains CloudFront as the value. The same error message and a response header value of Cloudfront might also be present when the reason the request is blocked isn't AWS WAF. To confirm that the request is blocked by AWS WAF and identify the rule that blocked it, check the AWS WAF logs for the blocked request. You also can check the AWS WAF CloudFront metrics for the relevant WebACL. Then, check the WebACL to see the rules that are blocked.

In short AWS WAF protections is detecting your program as a bot.


Solution

You can adapt a few tweaks to avoid ChromeDriver initiated Chrome Browser geting detected as a bot:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • thank you comment. but aws ec2, it works fine in the local environment, but I don't know why it doesn't work in gke. – 박우영 Aug 06 '23 at 06:58