0

I am writing a web scraper to extract the number of results of searching in a google search which appears on the top left of the page of search results. I have written the code below but I do not understand why phrase_extract is None. I want to extract the phrase "About 12,010,000,000 results". which part I am making a mistake? may be parsing the HTML incorrectly?

import requests
from bs4 import BeautifulSoup

def pyGoogleSearch(word):   
    address='http://www.google.com/#q='
    newword=address+word
    #webbrowser.open(newword)
    page=requests.get(newword)
    soup = BeautifulSoup(page.content, 'html.parser')
    phrase_extract=soup.find(id="resultStats")
    print(phrase_extract)

pyGoogleSearch('world')

example

wpercy
  • 9,636
  • 4
  • 33
  • 45
Rose A
  • 31
  • 1
  • 7
  • instead of scraping you should consider using their [API](https://developers.google.com/knowledge-graph/) – Gahan Nov 06 '18 at 18:01
  • That is not free over a certain amount. But Do you know why the result of API is different with this method? – Rose A Nov 08 '18 at 15:01
  • API is a more promising way than scraping. the site owner doesn't bide to inform you about changes and hence your code might become nonfunctional at a certain time. However, API is well developed and maintained and the response time is much quicker compared to scraping. – Gahan Nov 08 '18 at 18:50
  • @Gahan so that is the reason that when I scrape with beautiful soup I get different results in comparison to searching in google and alos api? It means I get three different results by these three different methods – Rose A Nov 08 '18 at 23:38
  • surely, because you scrape the data with id or class of tag and they might change it or nest among another tag. it's just html structure. whereas you have a documentation of APIs. – Gahan Nov 09 '18 at 05:10

3 Answers3

5

You're actually using the wrong url to query google's search engine. You should be using http://www.google.com/search?q=<query>.

So it'd look like this:

def pyGoogleSearch(word):
    address = 'http://www.google.com/search?q='
    newword = address + word
    page = requests.get(newword)
    soup = BeautifulSoup(page.content, 'html.parser')
    phrase_extract = soup.find(id="resultStats")
    print(phrase_extract)

You also probably just want the text of that element, not the element itself, so you can do something like

phrase_text = phrase_extract.text

or to get the actual value as an integer:

val = int(phrase_extract.text.split(' ')[1].replace(',',''))
wpercy
  • 9,636
  • 4
  • 33
  • 45
  • thank you! it works! but here is two problems. first, when I print phrase_extract it shows me this:
    About 515,000,000 results
    while the class is not "sd". The second problem is that the result 515,000,000 is different with the number I see when I search in google.
    – Rose A Nov 06 '18 at 18:18
  • I edited my question with the entire image of xml code below the window. The picture appeared in "entire image description here" in my question. – Rose A Nov 06 '18 at 18:27
  • Does anybody have any idea how to extract the exact number of 515,000,000 from the text? I use text.split but it gives me this error: ValueError: invalid literal for int() with base 10: '3,170,000,000' – Rose A Nov 06 '18 at 20:40
  • @RoseA I've added a snippet for grabbing the integer value – wpercy Nov 06 '18 at 21:35
  • thank you and if you know the answer of my other questions I would be grateful if you could adress them – Rose A Nov 08 '18 at 23:41
1

You could also try to see what output would be from div above. Sometimes it will show the output.

Also, make sure you're using user-agent since Google could treat your script as a tablet user-agent (of something different) with different .class, #id tags, and so on. This could be the reason why your output is empty [].

Here's the code and replit.com to see the number of search results:

from lxml import html
import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get('https://www.google.com/search?q=beautiful+cookies',
                        headers=headers,
                        stream=True)

response.raw.decode_content = True

tree = html.parse(response.raw)

# lxml is used to select element by XPath
# Requests + lxml: https://stackoverflow.com/a/11466033/1291371
# note: you can achieve it easily with bs4 as well by grabbing "#result-stats" id selector.
result = tree.xpath('//*[@id="result-stats"]/text()')[0]

print(result)

# About 3,890,000,000 results

Alternatively, you can use Google Search Engine Results API from SerpApi to achieve the same but in more easy fashion.

Part of JSON:

"search_information": {
 "organic_results_state":"Results for exact spelling",
 "total_results": 3890000000,
 "time_taken_displayed": 0.65,
 "query_displayed": "beautiful cookies"
}

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "beautiful cookies",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 4210000000

Discrailmer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
0

If you don't mind using just command line, try filtering with htmlq:

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
term="something"

curl --silent  -A "$user_agent" "https://www.google.com/search?hl=en&q=$term" | htmlq "#result-stats" | grep -o "About.*results" | grep -o '[0-9]' | tr -d "\n"

You could try other user agents to avoid this 403 error.

There are better ways (probably with awk or sed) instead of grep and td.

Pablo Bianchi
  • 1,824
  • 1
  • 26
  • 30