Answer from Andrej Kesely will throw an error since this css
class no longer exists:
gotolink = parser.find('div', class_='r').a["href"]
AttributeError: 'NoneType' object has no attribute 'a'
Learn more about user-agent
and request headers.
Basically user-agent
let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
In this case, you need to send a fake user-agent
so Google would treat your request as a "real" user visit, also known as user-agent
spoofing.
Pass user-agent
in request headers
:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = result.select_one('.yuRUbf a')['href']
print(link)
# https://www.instagram.com/selenagomez/
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
Essentially, the main difference in your case is that you don't need to think about how to bypass Google blocks if they appear or figure out how to scrape elements that are a bit harder to scrape since it's already done for the end-user. The only thing that needs to be done is just get the data you want from the JSON string.
Example code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means index of the first organic result
link = results['organic_results'][0]['link']
print(link)
# https://www.instagram.com/selenagomez/
Disclaimer, I work for SerpApi.