0

I'm struggling with scraping a website that has captcha, example of the page. What i discovered, that when using Selenium Chromedriver captcha must be entered only once and after that I can load pages as long as it takes without getting capthca. But scraping data through Selenium is very slow and is real pain to use generally, so I've tried another approach. I load any page in selenium only once, enter captcha and save chrome cookies by using

Set<Cookie> cookies = chromeDriver.manage().getCookies();

After that, I pass this set of Cookies to my request buider method

private Request buildRequest(String url, Set<Cookie> cookies) {
        Iterator<Cookie> iterator = cookies.iterator();
        StringBuilder cookieSb = new StringBuilder();
        while (iterator.hasNext()) {
            Cookie cookie = iterator.next();
            cookieSb.append(cookie.getName() + "=" + cookie.getValue() + "; ");
        }
        String cookie = cookieSb.toString();
        cookie = cookie.substring(0, cookie.length() - 2);
        return new Request.Builder()
                .url(url)
                .header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36")
                .header("Cookie", cookie)
                .build();
    }

And then execute this request via okHttp3Client

private Response getResponse(Request request) throws IOException {
        return client.newCall(request).execute();
    }

If no cookies or cookies with errors are passed, then immediate captha response is given back to such request, so I can tell that this method works to some point. Hovewer, after executing some number of requests I get captcha again, and if i reload page in chromedriver there also will be captcha. I couldn't discover the pattern, on which capthca is shown, it always takes different number of requests and time from first to last request. I've tried setting timeouts between request, doesn't help. I've also tried different combinations of headers besides cookies in request. I've tried attaining 100 valid cookies from different chrome windows and iterating through them, they all get capcha soon enough. I've tried to debug chromedriver internal okhttp calls, to copy it's logic, but it seem it's doesn't make request to website directly and is well hidden.

Am I missing something? Is there a way to improve my Request object, so I always get response without captcha?

OzzyFromOz
  • 81
  • 4
  • sounds like you are getting cookies tied to a session. This session may timeout... normally you'd be passed new ones, but you are still using the old ones so captcha appears. (plus new ones?... not familiar with okhttp or how/if it sets/gets cookies) Check cookie expiry time. – pcalkins Nov 09 '20 at 20:01

1 Answers1

1

You can use a CookieStore in case they are updating your cookies during the session. This will will replace setting the http cookies as headers. But will adapt to any Set-Cookie response headers.

https://stackoverflow.com/a/35346473/1542667

But they can and maybe using other signals like incoming request rates that don't look like a human using the site.

But you are in a cat and mouse game against a site, who quite rightly don't want you scraping their content.

FWIW I hope they win the battle, you're the "bad actor" in this scenario. Anyway, good luck.

Yuri Schimke
  • 12,435
  • 3
  • 35
  • 69