extract the number of results from google search

Question

I am writing a web scraper to extract the number of results of searching in a google search which appears on the top left of the page of search results. I have written the code below but I do not understand why phrase_extract is None. I want to extract the phrase "About 12,010,000,000 results". which part I am making a mistake? may be parsing the HTML incorrectly?

import requests
from bs4 import BeautifulSoup

def pyGoogleSearch(word):   
    address='http://www.google.com/#q='
    newword=address+word
    #webbrowser.open(newword)
    page=requests.get(newword)
    soup = BeautifulSoup(page.content, 'html.parser')
    phrase_extract=soup.find(id="resultStats")
    print(phrase_extract)

pyGoogleSearch('world')

example

instead of scraping you should consider using their [API](https://developers.google.com/knowledge-graph/) — Gahan, Nov 06 '18 at 18:01
That is not free over a certain amount. But Do you know why the result of API is different with this method? — Rose A, Nov 08 '18 at 15:01
API is a more promising way than scraping. the site owner doesn't bide to inform you about changes and hence your code might become nonfunctional at a certain time. However, API is well developed and maintained and the response time is much quicker compared to scraping. — Gahan, Nov 08 '18 at 18:50
@Gahan so that is the reason that when I scrape with beautiful soup I get different results in comparison to searching in google and alos api? It means I get three different results by these three different methods — Rose A, Nov 08 '18 at 23:38
surely, because you scrape the data with id or class of tag and they might change it or nest among another tag. it's just html structure. whereas you have a documentation of APIs. — Gahan, Nov 09 '18 at 05:10

wpercy · Accepted Answer · 2018-11-06T21:07:24.503

5

You're actually using the wrong url to query google's search engine. You should be using http://www.google.com/search?q=<query>.

So it'd look like this:

def pyGoogleSearch(word):
    address = 'http://www.google.com/search?q='
    newword = address + word
    page = requests.get(newword)
    soup = BeautifulSoup(page.content, 'html.parser')
    phrase_extract = soup.find(id="resultStats")
    print(phrase_extract)

You also probably just want the text of that element, not the element itself, so you can do something like

phrase_text = phrase_extract.text

or to get the actual value as an integer:

val = int(phrase_extract.text.split(' ')[1].replace(',',''))

edited Nov 06 '18 at 21:07

answered Nov 06 '18 at 17:58

wpercy

9,636
4
33
45

thank you! it works! but here is two problems. first, when I print phrase_extract it shows me this:
About 515,000,000 results
while the class is not "sd". The second problem is that the result 515,000,000 is different with the number I see when I search in google. – Rose A Nov 06 '18 at 18:18
I edited my question with the entire image of xml code below the window. The picture appeared in "entire image description here" in my question. – Rose A Nov 06 '18 at 18:27
Does anybody have any idea how to extract the exact number of 515,000,000 from the text? I use text.split but it gives me this error: ValueError: invalid literal for int() with base 10: '3,170,000,000' – Rose A Nov 06 '18 at 20:40
@RoseA I've added a snippet for grabbing the integer value – wpercy Nov 06 '18 at 21:35
thank you and if you know the answer of my other questions I would be grateful if you could adress them – Rose A Nov 08 '18 at 23:41

Dmitriy Zub · Answer 2 · 2022-01-21T06:35:55.833

You could also try to see what output would be from div above. Sometimes it will show the output.

Also, make sure you're using user-agent since Google could treat your script as a tablet user-agent (of something different) with different .class, #id tags, and so on. This could be the reason why your output is empty [].

Here's the code and replit.com to see the number of search results:

from lxml import html
import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get('https://www.google.com/search?q=beautiful+cookies',
                        headers=headers,
                        stream=True)

response.raw.decode_content = True

tree = html.parse(response.raw)

# lxml is used to select element by XPath
# Requests + lxml: https://stackoverflow.com/a/11466033/1291371
# note: you can achieve it easily with bs4 as well by grabbing "#result-stats" id selector.
result = tree.xpath('//*[@id="result-stats"]/text()')[0]

print(result)

# About 3,890,000,000 results

Alternatively, you can use Google Search Engine Results API from SerpApi to achieve the same but in more easy fashion.

Part of JSON:

"search_information": {
 "organic_results_state":"Results for exact spelling",
 "total_results": 3890000000,
 "time_taken_displayed": 0.65,
 "query_displayed": "beautiful cookies"
}

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "beautiful cookies",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 4210000000

Discrailmer, I work for SerpApi.

score 0 · Answer 3 · answered Mar 27 '22 at 15:46

If you don't mind using just command line, try filtering with htmlq:

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
term="something"

curl --silent  -A "$user_agent" "https://www.google.com/search?hl=en&q=$term" | htmlq "#result-stats" | grep -o "About.*results" | grep -o '[0-9]' | tr -d "\n"

You could try other user agents to avoid this 403 error.

There are better ways (probably with awk or sed) instead of grep and td.

extract the number of results from google search

3 Answers3

Linked