I'm struggling with scraping a website that has captcha, example of the page. What i discovered, that when using Selenium Chromedriver captcha must be entered only once and after that I can load pages as long as it takes without getting capthca. But scraping data through Selenium is very slow and is real pain to use generally, so I've tried another approach. I load any page in selenium only once, enter captcha and save chrome cookies by using
Set<Cookie> cookies = chromeDriver.manage().getCookies();
After that, I pass this set of Cookies to my request buider method
private Request buildRequest(String url, Set<Cookie> cookies) {
Iterator<Cookie> iterator = cookies.iterator();
StringBuilder cookieSb = new StringBuilder();
while (iterator.hasNext()) {
Cookie cookie = iterator.next();
cookieSb.append(cookie.getName() + "=" + cookie.getValue() + "; ");
}
String cookie = cookieSb.toString();
cookie = cookie.substring(0, cookie.length() - 2);
return new Request.Builder()
.url(url)
.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36")
.header("Cookie", cookie)
.build();
}
And then execute this request via okHttp3Client
private Response getResponse(Request request) throws IOException {
return client.newCall(request).execute();
}
If no cookies or cookies with errors are passed, then immediate captha response is given back to such request, so I can tell that this method works to some point. Hovewer, after executing some number of requests I get captcha again, and if i reload page in chromedriver there also will be captcha. I couldn't discover the pattern, on which capthca is shown, it always takes different number of requests and time from first to last request. I've tried setting timeouts between request, doesn't help. I've also tried different combinations of headers besides cookies in request. I've tried attaining 100 valid cookies from different chrome windows and iterating through them, they all get capcha soon enough. I've tried to debug chromedriver internal okhttp calls, to copy it's logic, but it seem it's doesn't make request to website directly and is well hidden.
Am I missing something? Is there a way to improve my Request object, so I always get response without captcha?