0

I took the code below from the answer How to use BeautifulSoup to parse google search results in Python

It used to work on my Ubuntu 16.04 and I have both Python 2 and 3.

The code is below:

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'My query goes here'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

#with open('output.html', 'wb') as f: 
#    f.write(response.content)
#webbrowser.open('output.html')

soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
    print(g.text)
    print('-----')

It executes but prints nothing. The problem is really suspicious to me. Any help would be appreciated.

2 Answers2

4

The problem is that Google is serving different HTML when you don't specify User-Agent in headers. To specify custom header, add dict with User-Agent to headers= parameter in requests:

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'My query goes here'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
    print(g.text)
    print('-----')

Prints:

How to Write the Perfect Query Letter - Query Letter Examplehttps://www.writersdigest.com/.../how-to-write-the-perfect-qu...PuhverdatudTõlgi see leht21. märts 2016 - A literary agent shares a real-life novel pitch that ultimately led to a book deal—and shows you how to query your own work with success.
-----
Inimesed küsivad ka järgmistHow do you start a query letter?What should be included in a query letter?How do you end a query in an email?How long is a query letter?Tagasiside
-----

...and so on.
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

Learn more about user-agent and request headers.

Basically, user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.

Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.

To make it look better, you can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):

params = {
  "q": "My query goes here"
}

requests.get("YOUR_URL", params=params)

Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "My query goes here"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  print(title)

-------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''

Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "My query goes here",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])

--------
'''
MySQL 8.0 Reference Manual :: 3.2 Entering Queries
Google Sheets Query function: Learn the most powerful ...
Understanding MySQL Queries with Explain - Exoscale
An Introductory SQL Tutorial: How to Write Simple Queries
Writing Subqueries in SQL | Advanced SQL - Mode
Getting IO and time statistics for SQL Server queries
How to store MySQL query results in another Table? - Stack ...
More efficient SQL with query planning and optimization (article)
Here are my Data Files. Here are my Queries. Where ... - CIDR
Slow in the Application, Fast in SSMS? - Erland Sommarskog
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35