Parse Google inner cards with Beautiful soup

Question

I'm trying to parse g-inner-card class = "_KBh" but, for some reason, it's returning an empty tuple. linkElems = soup.select('._KBh a')

linkElems = soup.select('._KBh a')

print(linkElems)

This returns an empty tuple [].

import webbrowser, sys, pyperclip, requests, bs4
if len(sys.argv) > 1:
    term = ' '.join(sys.argv[1:])
else:
    term = pyperclip.paste()
res = requests.get("https://www.google.com/search?q="+term)
try:
    res.raise_for_status()
except Exception as ex:
    print('There was a problem: %s' %(ex), '\nSorry!!')
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkElems = soup.select('._KBh a')
print(linkElems)
numOpen = min(3, len(linkElems))
for i in range(numOpen):
    print(linkElems[i].get('href'))
    webbrowser.open('https://google.com/' + linkElems[i].get('href'))

This code snippet is trying to open atmost 3 google search results in 3 different windows, of the browser, when command line argument(i.e the term to be searched) is entered.It specifically shows results of google inner cards.

What is it supposed to find do you mean? Is there a place where we can find this class? — Pax Vobiscum, Feb 13 '18 at 09:23
Edited the question for better understanding. Hope it clarifies your doubts. Please ask again if still not clear. — Abhishek Negi, Feb 13 '18 at 09:34
What is in res.text? Because one can't proceed with your code snippet without proper data, right? — Steffi Keran Rani J, Feb 13 '18 at 09:36
No, it does not. I can't find the class anywhere, are you sure it even exists? — Pax Vobiscum, Feb 13 '18 at 09:42
Edited again. Sorry for the incomplete data. First time asking question. Hope it helps. — Abhishek Negi, Feb 13 '18 at 09:44
Search federer, you will find class `_KBh` for the inner cards using developers tool. — Abhishek Negi, Feb 13 '18 at 09:48
@AbhishekNegi, but in `res.text`, there is no **._KBh a**. This is the reason you get empty tuple. Can you please recheck the `res.text`? — Steffi Keran Rani J, Feb 13 '18 at 09:59
`._KBh` is google inner card class(check webpage source by developers tool) and `a` is link tag. On replacing `._KBh a` with `.r a` gives top news results. So it should work. — Abhishek Negi, Feb 13 '18 at 10:07

Keyur Potdar · Accepted Answer · 2018-02-13T11:20:02.897

0

If you print res.text, you can see that you are not getting the complete/correct data from the page. This is happening because Google is blocking the Python script.

To overcome this, you can pass an User-Agent to make the script look like a real browser.

Results with default User-Agent:

>>> URL = 'https://www.google.co.in/search?q=federer'
>>> res = requests.get(URL)
>>> '_KBh' in res.text
False

After adding custom User-Agent:

>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
>>> res = requests.get(URL, headers=headers)
>>> '_KBh' in res.text
True

Adding the headers to your code, gives the following output (the first 3 links you were looking for):

https://www.express.co.uk/sport/tennis/918251/Roger-Federer-Felix-Auger-Aliassime-practice
https://sports.yahoo.com/breaks-lighter-schedules-help-players-improve-says-federer-092343458--ten.html
http://www.news18.com/news/sports/rafael-nadal-stays-atop-atp-rankings-roger-federer-aims-to-overtake-1658665.html

edited Feb 13 '18 at 11:20

answered Feb 13 '18 at 11:13

Keyur Potdar

7,158
6
25
40

Sorry for the late post. That worked fine but the same header didn't work for my browser(the windows opened but it showed error "not found on server "i.e.404). So I looked into how to make custom headers, but without success.Please clarify how to make one or provide link for it :) Also, how were you able to conclude that adding custom header would provide solution(i.e. google was blocking the py script). Thank you. – Abhishek Negi Feb 15 '18 at 15:35
You got a `404` because you're appending the url to `'https://google.com/'` when the url is complete (ie. `https://www.express.co.uk/...`). Use `webbrowser.open(linkElems[i].get('href'))`. It will work. – Keyur Potdar Feb 15 '18 at 15:41
There was no problem with the header, only the `webbrowser.open(...)` part. As for your second question, the first sentence in my post is the answer. As you move forward with web scraping, you'll encounter such problems (blocked python script) many times for different websites. The same thing happened to me many times, that's why I could conclude that adding custom header would provide solution. – Keyur Potdar Feb 15 '18 at 15:43
Was thinking the same thing. But how how did you provide the specifications for the custom header. Please specify :) – Abhishek Negi Feb 15 '18 at 15:49
I did not create that user agent on my own. It's the browser's user agent ([you can find it here](https://www.whatismybrowser.com/detect/what-is-my-user-agent)). Have a look at [this question](https://stackoverflow.com/questions/27652543/how-to-use-python-requests-to-fake-a-browser-visit). I think it will answer your questions. – Keyur Potdar Feb 15 '18 at 15:50
Thanks for the help. – Abhishek Negi Feb 15 '18 at 15:52

Dmitriy Zub · Answer 2 · 2021-09-14T05:16:14.477

As the other answer mentioned, it's because there was no user-agent specified. The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit. Check what's your user-agent.

If you want to scrape a lot, another good idea will be to rotate user-agent or add a header ordering to lower change to request being blocked.

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samurai cop what does katana mean",
  "gl": "in",
  "hl": "en"
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']
  print(f'{title}\n{link}\n')

-------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw

Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647

Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481

...
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that there's no need in figuring out why certain things don't work as they should and then maintain the parser over time, instead, the only thing that needs to be done is to iterate over structured JSON and get the data you want fast.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "samurai cop what does katana mean",
    "hl": "en",
    "gl": "in",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])
  print(result['link'])
  print()

------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw

Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
...
'''

Disclaimer, I work for SerpApi.

Parse Google inner cards with Beautiful soup

2 Answers2