2

I'm trying to write a small program, that you input a search query, it opens your browswer with the result and then scrapes the google search result and prints it, i don't know how i would go along doing the scraping part. this all i have so far:

import webbrowser 
query = input("What would you like to search: ")
for word in query:
    query = query + "+"
webbrowser.open("https://www.google.com/search?q="+query)

Let's say they say type: "Who is donald trump?" Their browser will open and this will show: donald trump search result

How would i go along and scrape the summary provided by wikipedia and then have it be printed back to the user? Or in any case scrape any data from a website???

uberdr3eam
  • 46
  • 1
  • 7
  • Are you talking about scraping the data from Wikipedia.com or scraping the little snippet Google gives you *provided* by Wikipedia? – Mangohero1 Aug 07 '17 at 20:42
  • the snippet would be preferred, as it provides a basic summary and that's all i need/ – uberdr3eam Aug 07 '17 at 20:44
  • I don't think that for loop does what you think it does. Try `query = query.replace(" ","+")`. – cdo256 Aug 07 '17 at 20:44

4 Answers4

2

To scrape just summary you can use select_one() method provided by bs4 by selecting CSS selector. You can use the SelectorGadget Chrome extension or any other to make a quick selection.

Make sure you're using a user-agent, otherwise, Google could block your request because the default user-agent will be python-requests (if you were using requests library) List of user-agents to fake user visit.

From there you can scrape every other part you want by using select_one() method. Keep in mind that you can scrape info from Knowladge graph only if Google provides it. You can make an if or try-except statement to handle exceptions.

Code and full example:

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=who is donald trump', headers=headers).text

soup = BeautifulSoup(html, 'lxml')

summary = soup.select_one('.Uo8X3b+ span').text
print(summary)

Output:

Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.

An alternative way to do it using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out playground to see if it suits your needs.

Example code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "who is donald trump",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

summary = results["knowledge_graph"]['description']
print(summary)

Output:

Donald John Trump is an American media personality and businessman who served as the 45th president of the United States from 2017 to 2021.
Born and raised in Queens, New York City, Trump attended Fordham University and the University of Pennsylvania, graduating with a bachelor's degree in 1968.

Disclaimer I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
1

Although there are really quite a few ways you can scrape data, I've demonstrated this using a library called BeautifulSoup. I believe it's a much more flexible option than using webbrowser to scrape data. Don't worry if this seems new to you, I'll walk you through the steps.


You'll need BeautifulSoup and requests modules. If you don't have them, install them with pip.
Import the modules:
import requests
from bs4 import BeautifulSoup

Get the user input and save it to a variable:

query = input("What would you like to search: ")
query = query.replace(" ","+")
query = "https://www.google.com/search?q=" + query

Use the requests module to send a GET request to the host:

r = requests.get(query)
html_doc = r.text

Instantiate a BeautifulSoup object:

soup = BeautifulSoup(html_doc, 'html.parser')

Finally scrape the desired text:

for s in soup.find_all(id="rhs_block"):
   print(s.text)

Notice the ID. This ID is the container where Google puts all the snippet text. In this way, it will literally spit out all the text it finds inside this container, but you can, of course, format it to look a little neater.
By the way, if you happen to run into a UnicodeEncodeError, you'll have to append .encode('utf-8') to the end of each text property.
Let me know if you have any more questions. Cheers!

Mangohero1
  • 1,832
  • 2
  • 12
  • 20
  • Just for the formal record: I was having a hard time getting requests to work. HTML scrapped with requests didn't include the rhs_block id (or any useful id for that matter). The answer by user Naazneen Jatu lead me towards selenium, but his response itself was not very useful! Here is a link to a great "tutorial" on how selenium works: https://stackoverflow.com/questions/45259232/scraping-google-finance-beautifulsoup/45259523#45259523 I'll warn everyone seeing this... only use selenium if requests is not working for you! Selenium is significantly more complex than requests. – Conrad Selig Jun 27 '19 at 08:50
  • This solution doensn't work now. The second answer from bottom of this page by Dimitry Zub works for now. Please make it the answer of this question – shakhyar.codes May 13 '21 at 04:52
0

I have used selenium web driver. And extracted the google results snippets successfully.

from selenium import webdriver
browser = webdriver.Chrome(path\chromedriver') 
#specify path of chrome driver
browser.get('http://google.co.in/')
sbar = browser.find_element_by_id('lst-ib')
sbar.send_keys(x) # x is the query
sbar.send_keys(Keys.ENTER)
#elements on search page of google are having different class and ids so we have to try among severals to get an answer.
try:
   elem = browser.find_element_by_css_selector('div.MUxGbd.t51gnb.lyLwlc.lEBKkf')
except:
   pass
try:
    elem = browser.find_element_by_css_selector('span.ILfuVd.yZ8quc')
except:
    pass
try:
    elem = browser.find_element_by_css_selector('div.Z0LcW')
except:
     pass
print (elem.text)

I hope it helps. If you find errors please let know! Ps. Take care of indentation

Note: you should have driver for the browser you will be using.

Naazneen Jatu
  • 526
  • 9
  • 19
-1

Above code works good except ID. with id="rhs_block" I don't get any results. Instead I used id="res". Maybe that's updated recently

Mayuri K
  • 21
  • 1
  • 4