8

The Google Web Search APIs appear to be dead (both the old SOAP and the newer AJAX). Is there a quick way to search Google for a string and return the number of results? I assume I just have to run the search and scrape the results, but I'd love to know if there's a better way.

Update: It turns out that any automated access to Google that doesn't use their new API https://developers.google.com/custom-search/json-api/v1/overview violates their terms of service, and is thus not recommended.

PurpleVermont
  • 1,179
  • 4
  • 18
  • 46

1 Answers1

10

There is still a free API, but here is a screen-scraper:

import requests
from bs4 import BeautifulSoup
import argparse

parser = argparse.ArgumentParser(description='Get Google Count.')
parser.add_argument('word', help='word to count')
args = parser.parse_args()

r = requests.get('http://www.google.com/search',
                 params={'q':'"'+args.word+'"',
                         "tbs":"li:1"}
                )

soup = BeautifulSoup(r.text)
print soup.find('div',{'id':'resultStats'}).text

Results:

$ python g.py jones
About 223,000,000 results
$ python g.py smith
About 325,000,000 results
$ python g.py 'smith and jones'
About 54,200,000 results
$ python g.py 'alias smith and jones'
About 181,000 results
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Oddly I'm getting a 404 error when I try this, even though I can load the search URL just fine in my browser: – PurpleVermont Apr 02 '15 at 00:04
  • ` 404 Not Found

    Not Found

    The requested URL /search was not found on this server.


    Apache/2.2.3 (Red Hat) Server at www.google.com Port 80
    `
    – PurpleVermont Apr 02 '15 at 00:04
  • There is, but if I'm getting a 404, I'm getting through the proxy. There's a different error I get when I don't get through the proxy. – PurpleVermont Apr 02 '15 at 15:41
  • The thing is, that's not Google's 404 page. Their 404 page has Google branding, the phrase "That's all we know.", and doesn't mention Apache or Red Hat. I'm afraid I can't help you further, except to point a finger at your proxy setup. – Robᵩ Apr 02 '15 at 16:56
  • I did notice that it's different than Google's standard 404 page. – PurpleVermont Apr 02 '15 at 17:41
  • Indeed it was proxy-related, and this answer helped: http://stackoverflow.com/questions/8287628/proxies-with-python-requests-module – PurpleVermont Apr 02 '15 at 17:50
  • This now works for me, except that I'm searching for a quoted string, and if it gets no hits, it's giving me the number of hits for the unquoted version, which I do not want. Is there another parameter I can add to the google search url to tell it not to do that? – PurpleVermont Apr 02 '15 at 18:02
  • Alternately I guess I could just scrape the results for the `No results found for "xyzzy and lmnop"` line in the page returned – PurpleVermont Apr 02 '15 at 18:09
  • That params line doesn't help. – PurpleVermont Apr 02 '15 at 19:10
  • also any hints on appropriate delays to avoid being captcha blocked? – PurpleVermont Apr 02 '15 at 19:20
  • Nope, I can't help you on captcha. – Robᵩ Apr 02 '15 at 19:28
  • As of Oct 2022 the div `id` changed so it is now `soup.find('div', {'id':'result-stats'}).text`. – Teque5 Oct 27 '22 at 16:04