Some hints to improve your scraping:
1. Use proxies
Proxies permit you to reduce chances to get caught by a captcha. You should use between 50 and 150 proxies depending on your average result set. Here are two websites that can provide some proxies: SEO-proxies.com or Proxify Switch Proxy.
// Setup proxy
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort))
// Fetch url with proxy
Document doc = Jsoup //
.proxy(proxy) //
.userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
.header("Content-Language", "en-US") //
.connect(searchUrl) //
.get();
2. Captchas
If by any mean, you get caught by captcha, you can use some online captcha solving services (Bypass Captcha, DeathByCaptcha to name a few). Below is a generic step by step procedure to get the captcha solved automatically:
- Detect captcha error page
--
try {
// Perform search here...
} catch(HttpStatusException e) {
switch(e.getStatusCode()) {
case java.net.HttpURLConnection.HTTP_UNAVAILABLE:
if (e.getUrl().contains("http://ipv4.google.com/sorry/IndexRedirect?continue=http://google.com/search/...")) {
// Ask online captcha service for help...
} else {
// ...
}
break;
default:
// ...
}
}
- Download the captcha image (CI)
--
Jsoup //
//.cookie(..., ...) // Some cookies may be needed...
.connect(imageCaptchaUrl) //
.ignoreContentType(true) // Needed for fetching image
.execute() //
.bodyAsBytes(); // byte[] array returned...
- Send CI to online captcha service online
--
This part depends on the captcha service API. You can find some services in this 8 best captcha solving services article.
3. Some other hints
The Hints for Google scrapers article can give you some more pointers for improving your code. You'll find the first two hints presented here plus some more:
- Cookies: clear them on each IP change or don't use them at all
- Threads: You should not open two many connections. Firefox limits itself to 4 connections per proxy.
- Returned results: append
&num=100
to your url to sent less requests
- Request rates: Make your requests look human. You should not send more than 500 requests per 24h per IP.
References :