0

I'm using Python 3. The code below is supposed to let the user enter a search term into the command line, after which it searches Google and runs through the HTML of the results page to find tags matching the CSS selector ('.r a').

Say we search for the term "cats." I know the tags I'm looking for exist on the "cats" search results page since I looked through the page source myself.

But when I run my code, the linkElems list is empty. What is going wrong?

    import requests, sys, bs4

    print('Googling...')
    res = requests.get('http://google.com/search?q='  +' '.join(sys.argv[1:]))
    print(res.raise_for_status())

    soup = bs4.BeautifulSoup(res.text, 'html5lib')
    linkElems = soup.select(".r a")
    print(linkElems)
Asker
  • 1,299
  • 2
  • 14
  • 31
  • Someone else had the same problem as me on the forum below. Someone said that this could have something to do with Javascript, but I don't understand the solution posted. https://python-forum.io/Thread-I-m-Feeling-Lucky-script-problem-again – Asker Nov 20 '19 at 08:26

2 Answers2

1

The ".r" class is rendered by Javascript, so it's not available in the HTML received. You can either render the javascript using selenium or similar or you can try a more creative solution to extracting the links from the tags. First check that the tags exist by finding them without the ".r" class. soup.find_all("a") Then as an example you can use regex to extract all urls beginning with "/url?q="

import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))
Matts
  • 1,301
  • 11
  • 30
  • Thanks for this answer, I'll give Selenium a try and report back. Also, could you point me to a resource for learning which classes are rendered by JavaScript and which are plain HTML? (I'm trying to get a better mental picture of the relationship between JS and the limitations of the requests module in Python. In particular, if Requests is unable to get JS-rendered classes, I wonder what the other limitations of the Requests module are.) – Asker Nov 21 '19 at 01:12
  • 1
    JS executes in the browser, and as Selenium is using a browser, it's able to render it. There's no method or resource that I'm aware of to determine which classes are rendered, other than by checking the response. Google probably has some advanced methods to prevent scraping of content. – Matts Nov 22 '19 at 08:56
0

The parts you want to extract are not rendered by JavaScript as Matts mentioned and you don't need regex for such a task.

Make sure you're using user-agent otherwise Google will block your request eventually. That might be the reason why you were getting an empty output since you received a completely different HTML. Check what is your user-agent. I already answered about what is user-agent and HTTP headers.

Pass user-agent into HTTP headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get("YOUR_URL", headers=headers)

html5lib is the slowest parser, try to use lxml instead, it's way faster. If you want to use even faster parser, have a look at selectolax.


Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "selena gomez"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''

Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with the parsing part, instead, you only need to iterate over structured JSON and get the data you want, plus you don't have to maintain the parser over time.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "selena gomez",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  link = result['link']
  print(link)

----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''

P.S - I wrote a blog post about how to scrape Google Organic Search Results.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35