0

So I want to find all the search results and store them in a list or something. Analysing the Google page give me that all results are technically in the g class:

Google Search analysis

So technically, extracting an URL (i.e.) from the search results page should be as easy as:

import urllib
from bs4 import BeautifulSoup
import requests

text = 'cyber security'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

And yet, I have no output. Why?

Edit: Even manually parsing the stored page doesn't help:

with open('output.html', 'wb') as f:
     f.write(response.content)
webbrowser.open('output.html')

url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")

#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)
Jishan
  • 1,654
  • 4
  • 28
  • 62

4 Answers4

1

The following approach should fetch you few random links out of the total result links from it's landing page. You may need to kick out some links ending with dots. It's really a difficult job to grab links from google search using requests.

import requests
from bs4 import BeautifulSoup

url = "http://www.google.com/search?q={}&hl=en"

def scrape_google_links(query):
    res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
        print(result.text.replace(" › ","/"))

if __name__ == '__main__':
    scrape_google_links('cyber security')
SIM
  • 21,997
  • 5
  • 37
  • 109
1

You can always climb several elements up or down to test out using next_sibling/previous_sibling or next_element/previous_element. All results are in the <div> element with .tF2Cxc class.

Scrape URLs is as easy as:

  1. make a for loop in combo with bs4 .select() method that which takes СSS selectors as an input.
  2. call .yuRUbf CSS selector with .select_one() method.
  3. call <a> tag with href attribute.
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href']

Code and example in the online IDE:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# containver with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href'] # or ('.yuRUbf a')['href']
  print(link)

# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://digitalguardian.com/blog/what-cyber-security
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://en.wikipedia.org/wiki/Computer_security
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
https://staysafeonline.org/
'''

Alternatively, you can do the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.

Code to integrate:

params = {
  "api_key": os.getenv("API_KEY"), # environment for API_KEY
  "engine": "google", # search engine
  "q": "cyber security", # query
  "hl": "en", # defining a language
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  link = result['link']
  print(link)

# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://digitalguardian.com/blog/what-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://staysafeonline.org/
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
0
from selenium import webdriver
from bs4 import BeautifulSoup
import time

browser = webdriver.Firefox()
dork = 'cyber security'
sada = browser.get(f"https://www.google.com/search?q={dork}")
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')

for item in soup.findAll('div', attrs={'class': 'r'}):
    for href in item.findAll('a'):
        print(href.get('href'))
-1

Actually, If you print the response.content and check the output you will find that there is no HTML tag with class g. It seems that these elements are coming via dynamic loading and BeautifulSoap loads the static content only. That is why when you look for HTML tags with class g it doesn't give any element in result.

HNMN3
  • 542
  • 3
  • 9
  • 1
    Yea, the reason why it's not shown up on the output, That's due that google using `JavaScript` rendering after page load. so the only way is to use `selenium` or `dryscrape` :) otherwise https://pypi.org/project/google-search-results-serpwow/ – αԋɱҽԃ αмєяιcαη Nov 23 '19 at 15:08
  • @Jishan check my answer :) – αԋɱҽԃ αмєяιcαη Nov 23 '19 at 15:51
  • @Jishan Buddy you are doing the same mistake again. **response.content** is not going to give you complete HTML page what you see in browser. Try saving the page from browser and then open it in your code. It will work properly. – HNMN3 Nov 23 '19 at 16:21