3

I'm trying to parse g-inner-card class = "_KBh" but, for some reason, it's returning an empty tuple. linkElems = soup.select('._KBh a')

linkElems = soup.select('._KBh a')

print(linkElems)

This returns an empty tuple [].

import webbrowser, sys, pyperclip, requests, bs4
if len(sys.argv) > 1:
    term = ' '.join(sys.argv[1:])
else:
    term = pyperclip.paste()
res = requests.get("https://www.google.com/search?q="+term)
try:
    res.raise_for_status()
except Exception as ex:
    print('There was a problem: %s' %(ex), '\nSorry!!')
soup = bs4.BeautifulSoup(res.text, "html.parser")
linkElems = soup.select('._KBh a')
print(linkElems)
numOpen = min(3, len(linkElems))
for i in range(numOpen):
    print(linkElems[i].get('href'))
    webbrowser.open('https://google.com/' + linkElems[i].get('href'))

This code snippet is trying to open atmost 3 google search results in 3 different windows, of the browser, when command line argument(i.e the term to be searched) is entered.It specifically shows results of google inner cards.

2 Answers2

0

If you print res.text, you can see that you are not getting the complete/correct data from the page. This is happening because Google is blocking the Python script.

To overcome this, you can pass an User-Agent to make the script look like a real browser.

Results with default User-Agent:

>>> URL = 'https://www.google.co.in/search?q=federer'
>>> res = requests.get(URL)
>>> '_KBh' in res.text
False

After adding custom User-Agent:

>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
>>> res = requests.get(URL, headers=headers)
>>> '_KBh' in res.text
True

Adding the headers to your code, gives the following output (the first 3 links you were looking for):

https://www.express.co.uk/sport/tennis/918251/Roger-Federer-Felix-Auger-Aliassime-practice
https://sports.yahoo.com/breaks-lighter-schedules-help-players-improve-says-federer-092343458--ten.html
http://www.news18.com/news/sports/rafael-nadal-stays-atop-atp-rankings-roger-federer-aims-to-overtake-1658665.html
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • Sorry for the late post. That worked fine but the same header didn't work for my browser(the windows opened but it showed error "not found on server "i.e.404). So I looked into how to make custom headers, but without success.Please clarify how to make one or provide link for it :) Also, how were you able to conclude that adding custom header would provide solution(i.e. google was blocking the py script). Thank you. – Abhishek Negi Feb 15 '18 at 15:35
  • You got a `404` because you're appending the url to `'https://google.com/'` when the url is complete (ie. `https://www.express.co.uk/...`). Use `webbrowser.open(linkElems[i].get('href'))`. It will work. – Keyur Potdar Feb 15 '18 at 15:41
  • There was no problem with the header, only the `webbrowser.open(...)` part. As for your second question, the first sentence in my post is the answer. As you move forward with web scraping, you'll encounter such problems (blocked python script) many times for different websites. The same thing happened to me many times, that's why I could conclude that adding custom header would provide solution. – Keyur Potdar Feb 15 '18 at 15:43
  • Was thinking the same thing. But how how did you provide the specifications for the custom header. Please specify :) – Abhishek Negi Feb 15 '18 at 15:49
  • I did not create that user agent on my own. It's the browser's user agent ([you can find it here](https://www.whatismybrowser.com/detect/what-is-my-user-agent)). Have a look at [this question](https://stackoverflow.com/questions/27652543/how-to-use-python-requests-to-fake-a-browser-visit). I think it will answer your questions. – Keyur Potdar Feb 15 '18 at 15:50
  • Thanks for the help. – Abhishek Negi Feb 15 '18 at 15:52
0

As the other answer mentioned, it's because there was no user-agent specified. The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit. Check what's your user-agent.

If you want to scrape a lot, another good idea will be to rotate user-agent or add a header ordering to lower change to request being blocked.


Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samurai cop what does katana mean",
  "gl": "in",
  "hl": "en"
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']
  print(f'{title}\n{link}\n')

-------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw

Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647

Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481

...
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that there's no need in figuring out why certain things don't work as they should and then maintain the parser over time, instead, the only thing that needs to be done is to iterate over structured JSON and get the data you want fast.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "samurai cop what does katana mean",
    "hl": "en",
    "gl": "in",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])
  print(result['link'])
  print()

------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw

Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
...
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35