0

I'm currently building a web spider with java apache commons. I'm crawling basic google search queries like https://google.com/search?q=word&hl=en

Somehow after about 60 queries I get blocked, it seems they recognize me as a bot and I get a 503 Service Unavailable response

Now the important part: If I visit the same site with firefox/chrome I get the desired result. If I make a GET Request with my Application using the same http header (user-agent, cookies, cache etc.) I am still blocked.

HOW does Google know whether I'm connecting via Application or Chrome-Browser, when there is only the IP and the HTTP-Header as Information?(maybe I'm wrong?) Are there more parameters to recognize my App? Something that Google sees and I don't?

(Maybe important: I'm using Chrome Developer Tools and httpbin.org to compare the headers of Browser and Application.)

Thanks a lot

Schnurbert
  • 135
  • 1
  • 9

1 Answers1

1

Since you have not specified how quickly you send the 60 queries, I am assuming at a high rate. This is why google is blocking you. Several times I have rapidly done google searches from chrome and it asks for a captcha after a while and then blocks soon after.

Please see the API on Custom Search and this post about terms of Service Replacement for Google API

FAQ on blocked searches: Google FAQ

chongo2002
  • 131
  • 1
  • 5
  • 12
  • Thanks, probably thats the reason, why I was blocked initially.. still, if I make a single request from my App afterwards, I still get blocked, while using the browser works fine.. how does google distinguish between the two? – Schnurbert Nov 14 '17 at 18:30
  • And which timespan should I wait between single requests.. I tried out 5+random(5) seconds earlier but got blocked anyway – Schnurbert Nov 14 '17 at 18:32
  • Added some reference links to the answer – chongo2002 Nov 15 '17 at 00:34