3

This is the thing I've done so far:

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=programming"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib')

table = soup.find('div', attrs = {'id':'result-stats'}) 

print(table)

I want it to get the number of results in an integer that would be the number 1350000000.

Daniel
  • 103
  • 1
  • 9

3 Answers3

4

You are missing header User-Agent which is a string to tell the server what kind of device you are accessing the page with .

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
URL     = "https://www.google.com/search?q=programming"
result = requests.get(URL, headers=headers)    

soup = BeautifulSoup(result.content, 'html.parser')

total_results_text = soup.find("div", {"id": "result-stats"}).find(text=True, recursive=False) # this will give you the outer text which is like 'About 1,410,000,000 results'
results_num = ''.join([num for num in total_results_text if num.isdigit()]) # now will clean it up and remove all the characters that are not a number .
print(results_num)
Ahmed Soliman
  • 1,662
  • 1
  • 11
  • 16
0

This code will do the trick:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
result = requests.get("https://www.google.com/search?q=programming", headers=headers)

src = result.content
soup = BeautifulSoup(src, 'lxml')

print(soup.find("div", {"id": "result-stats"}))
HSB
  • 70
  • 8
  • Adding headers doesn't prevent being detected as a bot – Ahmed Soliman Apr 06 '20 at 17:30
  • @Ahmed Soliman What is it for then? I will edit my answer accordingly – HSB Apr 06 '20 at 17:37
  • User-Agent is a string to tell the server what kind of device you are accessing the page with . If you overload the server with too many requests you will be blocked despite the fact that you are sending a header user-agent . – Ahmed Soliman Apr 06 '20 at 17:42
0

If you need to extract just one element, use select_one() bs4 method. It's a bit more readable and a bit faster than find(). CSS selectors reference.

If you need to extract data very fast, try to use selectolax which is a wrapper of lexbor HTML Renderer library written in pure C with no dependencies, and it's fast.

Code and example in the online IDE:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah definition",  # query
  "gl": "us",                    # country 
  "hl": "en"                     # language
}

response = requests.get('https://www.google.com/search',
                        headers=headers,
                        params=params)
soup = BeautifulSoup(response.text, 'lxml')

# .previous_sibling will go to, well, previous sibling removing unwanted part: "(0.38 seconds)"
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 107,000 results

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. it's a paid API with a free plan.

The difference in your case is that the only thing that you need to do is to get the data from the structured JSON you want, rather than figuring out how to extract certain elements or how to bypass blocks from Google.

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah defenition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 107000

P.S - I wrote a blog post about how to scrape Google Organic Results.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35