8

I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far:

from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests

address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()

myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing

print(newString)

qstr = urllib.parse.quote_plus(newString)
# Encode the string

newWord = address + qstr
# Combine the base and the encoded query

print(newWord)

source = requests.get(newWord)

soup = BeautifulSoup(source.text, 'lxml')

The part I am stuck on now is going down the HTML path to parse the specific data that I want. Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]".

I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want. I have found that these are the individual search results in the page:

https://ibb.co/jfRakR

Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated.

Thank you!

DevinGP
  • 199
  • 2
  • 2
  • 10
  • 1
    you might be struggling with this as Google renders a lot of its page with Javascript, so the markup you see if not present in the downloaded data. Have you checked in the actual data that the markup you want is there? You may want to consider using the Google custom search API https://developers.google.com/custom-search/json-api/v1/overview since I hear that Google routinely changes the markup on its search results pages. – Phil Dec 21 '17 at 18:07
  • Google uses JavaScript to put data on page. BS doesn't run JavaScript. If you turn off JavaScript in browser and load Google page then you see it sends page with data but in different tags. – furas Dec 21 '17 at 21:38

2 Answers2

13

Your url doesn't work for me. But with https://google.com/search?q= I get results.

import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser

text = 'hello world'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

#with open('output.html', 'wb') as f:
#    f.write(response.content)
#webbrowser.open('output.html')

soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
    print(g.text)
    print('-----')

Read Beautiful Soup Documentation

furas
  • 134,197
  • 12
  • 106
  • 148
8
  1. Default Google search address start doesn't contain # symbol. Instead, it should have ? and /search pathname:
---> https://google.com/#q=
---> https://www.google.com/search?q=cake
  1. Make sure you're passing user-agent into HTTP request headers because the default requests user-agent is python-requests and sites could identify that it's a bot and block the request thus you would receive a different HTML with some sort of an error that contains different elements/selectors which is the reason you were getting an empty result.

Check what's your user-agent, and a list of user-agents for mobile, tablets, etc.

Pass user-agent in request headers:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
requests.get('YOUR_URL', headers=headers)

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, json, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
  'q': 'tesla',  # query 
  'gl': 'us',    # country to search from
  'hl': 'en'     # language
}

# https://requests.readthedocs.io/en/latest/user/quickstart/#timeouts
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')

data = []

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']

  # sometimes there's no description and we need to handle this exception
  try: 
    snippet = result.select_one('#rso .lyLwlc').text
  except: snippet = None

data.append({
   'title': title,
   'link': link,
   'snippet': snippet
})

print(json.dumps(data, indent=2, ensure_ascii=False))

-------------
'''
[
  {
    "title": "Tesla: Electric Cars, Solar & Clean Energy",
    "link": "https://www.tesla.com/",
    "snippet": "Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ..."
  },
  {
    "title": "Tesla, Inc. - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Tesla,_Inc.",
    "snippet": "Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ..."
  },
  {
    "title": "Nikola Tesla - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Nikola_Tesla",
    "snippet": "Nikola Tesla was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the ..."
  }
]
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan just to test the API.

The difference in your case is that you don't have to figure out why the output is empty and what causes this to happen, bypass blocks from Google or other search engines, and maintain the parser over time.

Instead, you only need to grab the data from the structured JSON you want.

Example code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",                  # serpapi parsing engine
  "q": "tesla",                        # search query
  "hl": "en",                          # language of the search
  "gl": "us",                          # country from where search initiated
  "api_key": os.getenv("API_KEY")      # your serpapi API key
}
 
search = GoogleSearch(params)          # data extraction on the SerpApi backend
results = search.get_dict()            # JSON -> Python dict

for result in results["organic_results"]:
  print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")

----------
'''
Title: Tesla: Electric Cars, Solar & Clean Energy
Summary: Tesla is accelerating the world's transition to sustainable energy with electric cars, solar and integrated renewable energy solutions for homes and ...
Link: https://www.tesla.com/

Title: Tesla, Inc. - Wikipedia
Summary: Tesla, Inc. is an American electric vehicle and clean energy company based in Palo Alto, California, United States. Tesla designs and manufactures electric ...
Link: https://en.wikipedia.org/wiki/Tesla,_Inc.
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
  • doesn't serpAPI only allow 100 searches a month under the free plan? https://serpapi.com/pricing – gnoodle Dec 25 '21 at 22:24
  • @gnoodle If you would like to have more than 100 requests, [contact SerpApi directly](https://serpapi.com/#contact). In your opinion, what would make a good free plan? – Dmitriy Zub Dec 27 '21 at 14:37
  • 1
    I'm not criticising that SerpAPI only offers 100 requests. I'm simply pointing it out as a fact of relevance to most people reading your answer. I see you've posted answers similar to this one in response to many similar questions, I would recommend clarifying this limitation in all the answers. – gnoodle Dec 28 '21 at 16:14
  • But in answer to your question, if the intention is to allow a free demo of your API I guess 100-200 would suffice so people can get the idea. But to do anything meaningful would usually require at least 1000 requests – gnoodle Dec 28 '21 at 16:16
  • 1
    @gnoodle Appreciate your feedback and thank you for your thoughts. – Dmitriy Zub Jan 25 '22 at 11:39