1

I am trying to learn how to use BS4 but I ran into this problem. I try to find the text in the Google Search results page showing the number of results for the search but I can't find no text 'results' neither in the html_page nor in the soup HTML parser. This is the code:

from bs4 import BeautifulSoup
import requests

url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')

print(b'results' in html_page)
print('results' in soup)

Both prints return False, what am I doing wrong? How to fix that?

EDIT:

Turns out the language of the webpage was a problem, adding &hl=en to the URL almost fixed it.

url = 'https://www.google.com/search?q=stack&hl=en'

The first print is now True but the second is still False.

Gustavo
  • 668
  • 13
  • 24
  • 1
    The first one works for me (and the second line would normally print `False`). Did you try `print`ing `html_page`? That will tell you. You are probably being served a captcha. – Selcuk Aug 15 '19 at 00:07
  • 2
    Google is not a great example to learn parsing HTML. They excessively use AJAX to build the page and have several anti scraping methods in place. – Klaus D. Aug 15 '19 at 00:11
  • @Selcuk Yes I tried printing the page and it looked like HTML code – Gustavo Aug 15 '19 at 00:14
  • @KlausD. so scraping Google is a bad idea then? I wanted to build something to scrape Google specifically. – Gustavo Aug 15 '19 at 00:14
  • 4
    Good luck then. Be aware that they change their page, sometimes even multiple times a day, to make that as hard as possible. They want you to use they APIs (and throw in some coins). – Klaus D. Aug 15 '19 at 00:18
  • 1
    @GustavoMaia It will always _look like_ HTML code. The question is if it is the expected HTML code. – Selcuk Aug 15 '19 at 00:23
  • What is your question then? This is normal behaviour. – Selcuk Aug 15 '19 at 00:24
  • How to make the second print return True? – Gustavo Aug 15 '19 at 00:24
  • https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal should help for your second question. – Axiumin_ Aug 15 '19 at 00:49
  • `soup` is not a text and checking `text in soup` may never gives `True`. You may try `"results" in soup.strings` but it will works if there is exactly `results`, not `"results"` inside longer text. – furas Aug 15 '19 at 00:57

2 Answers2

2

requests library when returning the response in form of response.content usually returns in a raw format. So to answer your second question, replace the res.content with res.text.

from bs4 import BeautifulSoup
import requests

url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')

print('results' in soup)
Output: True

Keep in mind, Google is usually very active in handling scrapers. To avoid getting blocked/captcha'ed, you can add a user agent to emulate a browser. :

# This is a standard user-agent of Chrome browser running on Windows 10 
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' } 

Example:

from bs4 import BeautifulSoup
import requests 
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser') 
...
<your code here>

Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:

headers = { 
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip', 
'DNT' : '1', # Do Not Track Request Header 
'Connection' : 'close'
}
0xInfection
  • 2,676
  • 1
  • 19
  • 34
1

It's not because res.content should be changed to res.text as 0xInfection mentioned, it would still return the result.

However, in some cases, it will only return bytes content if it's not gzip or deflate transfer-encodings, which are automatically decoded by requests to a readable format (correct me in the comments or edit this answer if I'm wrong).

It's because there's no user-agent specified thus Google will block a request eventually because default requests user-agent is python-requests and Google understands that it's a bot/script. Learn more about request headers.

Pass user-agent into request headers:

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

request.get('YOUR_URL', headers=headers)

Code and example in the online IDE:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah definition",  # query 
  "gl": "us",                    # country to make request from
  "hl": "en"                     # language
}

response = requests.get('https://www.google.com/search',
                        headers=headers,
                        params=params).content
soup = BeautifulSoup(response, 'lxml')

number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 114,000 results

Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you only need to extract the data you want without thinking about how to extract stuff or figure out how to bypass blocks from Google or other search engines since it's already done for the end-user.

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah definition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 112000

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35