1

I'm working currently on web scraping and I need to extract a description of a city in a google search result.

Let's say that I want a description of Madrid city, I searched and got the following result:

I need to extract the highlighted text

This is the source code for the target div:

<div jscontroller="GCSbhd" class="kno-rdesc" jsaction="seM7Qe:c0XUbe;Iigoee:c0XUbe;rcuQ6b:npT2md">
    <h3 class="Uo8X3b OhScic zsYMMe">Description</h3>
    <span>Située au centre de l'Espagne, Madrid, sa capitale, est une ville dotée d'élégants boulevards et de vastes parcs très bien entretenus comme le Retiro. Elle est réputée pour ses riches collections d'œuvres d'art européennes, avec notamment celles du musée du Prado, réalisées par Goya, Velázquez et d'autres maîtres espagnols. Au cœur de la vieille Madrid des Habsbourgs se trouve la Plaza&nbsp;Mayor, bordée de portiques, et, à proximité, le Palais royal baroque et son Armurerie, qui comporte des armes historiques.
        <span>
            <span class="eHaQD"> ―&nbsp;Google
            </span>
        </span>
    </span>
</div>

I tried scraping the content and selecting the <h3> tag and then select its sibling but the result is None, this is the code used:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('h3', {'class': 'Uo8X3b OhScic zsYMMe'})
print(target_div_PresMadrid)

I even tried to select the only parent <div> that doesn't change its class but the code returns None as well, this the code for it:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('div', {'class': 'liYKde g VjDLd'})
print(target_div_PresMadrid)

Can anyone help me to understand the mechanics of the search engine so that I can extract that paragraph

2 Answers2

1

If you disable JavaScript in your browser, you'll see that the paragraph you want is actually under the class BNeawe s3v9rd AP7Wnd:

<div class="BNeawe s3v9rd AP7Wnd">
 Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.
</div>

the requests library doesn't support JavaScript. So, you need to access this class BNeawe s3v9rd AP7Wnd.

Although there are multiple classes with that name, since find() only returns the first match, you are fine to use it

import requests
from bs4 import BeautifulSoup


url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, "html.parser")
target_div_PresMadrid = soup_PresMadrid.find("div", {"class": "BNeawe s3v9rd AP7Wnd"})
print(target_div_PresMadrid.text)

Output:

Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.

See also:

MendelG
  • 14,885
  • 4
  • 25
  • 52
  • 1
    Thanks a lot for your answer, it worked fine and made me search a bit more into the problem. I've found that Dryscrape could be more useful, but since it wouldn't install properly on the machine I found a solution by using the headers and parameters for the `requests` library. – Med. Amine Aljane Jun 29 '21 at 16:49
0

You're looking for this:

soup.select_one('.zsYMMe+ span') # css selector for knowledge graph description

Try SelectorGadget Chrome extenstion to grab css selectors. CSS selectors reference.

Make sure you're using user-agent aka headers to decrease the number of blocked requests. What is my user-agent?

Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
    "Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'Madrid',
  'hl': 'en',
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

# not every knowledge graph has snippet (description), that's why try/except is here
try:
    snippet = soup.select_one('.zsYMMe+ span').text
except: snippet = None
print(snippet)

----
'''
Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry. &horbar; Google
'''

Alternatively, you can use Google Knowledge Graph API from SerpApi. It's a paid API with a free plan.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google",
    "q": "dell",
    "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

snippet = results['knowledge_graph']['description']
print(snippet)

-------
'''
Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry. 
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35