0

I have data which I need to modify using the first entry of a certain Google search. This search has to be repeated for about 300 000 times (each row) with varying search keywords.

I wrote a bash script for that using wget. However after about 30 (synchronous) requests, my queries seem to get blocked.

Connecting to www.google.com (www.google.com)|74.125.24.103|:80... connected. HTTP request sent, awaiting response... 404 Not Found

ERROR 404: Not Found.

I am using this snippet:

wget -qO- ‐‐limit-rate=20k --user-agent='Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0' "http://www.google.de/search?q=wikipedia%20$encodedString"

I am dependent on it to work so I hope someone has experience. It is not a regular job and does not need to be done quickly - it would even be acceptable if the 300000 requests take over a week.

nauti
  • 1,396
  • 3
  • 16
  • 37

1 Answers1

1

Google won't let you do this; it has a rather advanced set of heuristics to detect "non-human" usage. If you want to do something automated with Google, it kind of forces you to use their API.

Other than distributing your queries over a very large set of clients (given the fact that you have 3*10^5 queries, and get blocked after 3*10^1, I'd say around 10,000), which is neither feasible nor really in the right order of complexity, you'll need to use any automatable API.

Luckily, Google offers a JSON API, which is far better parseable by scripts, so have a look at https://stackoverflow.com/a/3727777/4433386 .

Community
  • 1
  • 1
Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • Unfortunately the Google Web Search API is deprecated. "Its last day of operation will be September 29, 2014." Also, parsing is not an issue. – nauti Mar 05 '15 at 01:30
  • "Not an issue" for you perhaps, but poor design. – tripleee Mar 05 '15 at 09:37
  • I'd second @tripleee on that: If you can get something in a reliably well-defined data format, having to parse HTML is really really a bad choice. – Marcus Müller Mar 05 '15 at 10:18