why is nothing getting parsed in my web scraping program?

Question

I made this code to search all the top links in google search. But its returning none.

import webbrowser, requests
from bs4 import BeautifulSoup
string = 'selena+gomez'
website = f'http://google.com/search?q={string}'
req_web = requests.get(website).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)

Andrej Kesely · Accepted Answer · 2020-06-04T18:19:16.713

1

Google needs that you specify User-Agent http header to return correct page. Without the correct User-Agent specified, Google returns page that doesn't contain <div> tags with r class. You can see it when you do print(soup) with and without User-Agent.

For example:

import requests
from bs4 import BeautifulSoup

string = 'selena+gomez'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
website = f'http://google.com/search?hl=en&q={string}'

req_web = requests.get(website, headers=headers).text
parser = BeautifulSoup(req_web, 'html.parser')
gotolink = parser.find('div', class_='r').a["href"]
print(gotolink)

Prints:

https://www.instagram.com/selenagomez/?hl=en

edited Jun 04 '20 at 18:19

answered Jun 04 '20 at 18:06

Andrej Kesely

168,389
15
48
91

This is very helpful (to me). Would you be able to add a sentence about how this solves the problem? – andrewJames Jun 04 '20 at 18:16
@andrewjames I added some explanation. It boils down, that without `User-Agent` Goggle returns other version of HTML that you see in browser. – Andrej Kesely Jun 04 '20 at 18:19
1

@AndrejKesely thank you very much bro!!!! That solved my problem... – Jun 05 '20 at 17:06

Dmitriy Zub · Answer 2 · 2021-09-06T16:57:34.273

Answer from Andrej Kesely will throw an error since this css class no longer exists:

gotolink = parser.find('div', class_='r').a["href"]
AttributeError: 'NoneType' object has no attribute 'a'

Learn more about user-agent and request headers.

Basically user-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.

In this case, you need to send a fake user-agent so Google would treat your request as a "real" user visit, also known as user-agent spoofing.

Pass user-agent in request headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get(YOUR_URL, headers=headers)

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "selena gomez"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

link = result.select_one('.yuRUbf a')['href']
print(link)

# https://www.instagram.com/selenagomez/

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

Essentially, the main difference in your case is that you don't need to think about how to bypass Google blocks if they appear or figure out how to scrape elements that are a bit harder to scrape since it's already done for the end-user. The only thing that needs to be done is just get the data you want from the JSON string.

Example code:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "selena gomez",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# [0] means index of the first organic result 
link = results['organic_results'][0]['link']
print(link)

# https://www.instagram.com/selenagomez/

Disclaimer, I work for SerpApi.

why is nothing getting parsed in my web scraping program?

2 Answers2

Linked