1

My goal is to write a webscraping program in python that parses a google search results page using beautifulsoup and opens several result links at a time. The program looks like this:

#! python3
# searchGoogle.py - Opens several google results.

import requests, sys, webbrowser, bs4
print('Searching...') # display text while downloading the result page
res = requests.get('https://www.google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()

# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'html.parser')

# Open a browser tab for each result.
linkElems = soup.select('div.yuRUbf > a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    urlToOpen = linkElems[i].get('href')
    print('Opening', urlToOpen)
    webbrowser.open(urlToOpen)

Since my HTML-skills are limited, I don't know exactly, how to retrieve the HTML-elements that contain the links.

Here is the web page I want to parse: https://www.google.com/search?q=boring+stuff

My browser's developer console shows the following HTML-code:

enter image description here

All links are in elements with class="yuRUbf" (I have marked one example in the attached picture.)

My question: What is the correct argument, that I have to pass to the soup.select() method? Because all 'a' elements are directly within 'div' elements and those have a class attribute named 'yuRUbf', I thought 'div.yuRUbf > a' is correct...but the program does not work. The web pages are not opened in the browser.

Which experienced HTML developer can help me with this problem? Is my argument that I pass to soup.select() method incorrect? What should it be? Or is the problem somewhere else?

I am using MacOS Catalina and Python 3.8.

aurumpurum
  • 932
  • 3
  • 11
  • 27
  • 1
    Please either paste the link to that web page, or paste the actual code, not an image. If anybody wants to try out your code to help, we need actual html to test it on. – William Jun 06 '21 at 17:38
  • 1
    The DOM ≠ the page source. Since it looks like you’re using chrome, a more accurate representation of the page you’ll get using an HTTP client like `requests` would be in the actual *View Page Source* view (Windows: CTRL-U, macOS: Command-U). It’s likely none of the elements you’re targeting are present in the actual page source and instead are dynamically generated using JavaScript, which neither `requests` nor `BeautifulSoup` has the ability to interpret/execute. – esqew Jun 06 '21 at 17:46
  • @William Ok, sorry, I posted the link to the search results page. – aurumpurum Jun 06 '21 at 18:03
  • 1
    @aurumpurum Are you trying to learn bs4? Or parse Google? If you want to search google, there's a great package for that: https://pypi.org/project/googlesearch-python/ – William Jun 06 '21 at 18:07
  • @William trying to learn bs4 for webscraping :-) I know there are other packages like selenium but I want to go step by step. Thanks for your googlesearch_python package! Looks interesting, will check it out! Cheers. – aurumpurum Jun 06 '21 at 18:59
  • @esqew I am using Brave browser. View page source on macOS is opt + cmd + U. To open the developer console I use right click and then "inspect". Or in the menu view --> developer --> inspect elements. I am looking for the elements in the developer console (not in the page source view). But how should I interpret your answer? Could you explain it to me? – aurumpurum Jun 07 '21 at 01:39

2 Answers2

4

To obtain correct result from Google server, set User-Agent HTTP header. You can then use CSS selector a:has(h3) to get your links:

import requests
from bs4 import BeautifulSoup


url = "https://www.google.com/search"
params = {"q": "boring stuff"}  # add "hl":"en" to get english results
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
}
soup = BeautifulSoup(
    requests.get(url, params=params, headers=headers).content, "html.parser"
)

for a in soup.select("a:has(h3)"):
    print(a["href"])

Prints:

https://automatetheboringstuff.com/
https://www.martinus.sk/?uItem=231151
https://www.amazon.com/Automate-Boring-Stuff-Python-2nd/dp/1593279922
https://www.amazon.com/Automate-Boring-Stuff-Python-Programming/dp/1593275994
https://knihy.heureka.sk/automate-the-boring-stuff-with-python-sweigart-albert/
https://www.udemy.com/course/automate/
https://inventwithpython.com/blog/2019/10/07/whats-new-in-the-2nd-edition-of-automate-the-boring-stuff-with-python/
https://towardsdatascience.com/how-to-use-bash-to-automate-the-boring-stuff-for-data-science-d447cd23fffe
https://www.barnesandnoble.com/w/automate-the-boring-stuff-with-python-2nd-edition-al-sweigart/1133598925
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks! Looks great, I will check it in depth...because I am learning how to use CSS selectors: could you comment my attempt to pass 'div.yuRUbf > a' to the soup.select() method? Is my idea totally wrong? Or can I adapt it? Just out of curiosity, before I move on and check your solution. Thank you! – aurumpurum Jun 06 '21 at 19:05
  • 2
    @aurumpurum From experience, when i see class names such as `"yuRUbf"` i know, they change often (generally). So I try to find other "patterns". Like on this site: I see the link text are inside `

    ` tags so I use this information to construct the CSS selector.

    – Andrej Kesely Jun 06 '21 at 19:07
  • Ok, that makes sense, thanks. But where do you see the

    tags? In my developer console, I haven't seen any (see print screen). Again, this is just to have a learning from this case...

    – aurumpurum Jun 06 '21 at 19:28
  • 1
    @aurumpurum I usually do `print(soup.prettify())` to observe what HTML the server returned. – Andrej Kesely Jun 06 '21 at 20:47
  • Hey Andrej, I am still working on this case: `soup.select("a:has(h3)")` returns me an empty list. What could be wrong? Is there a way I could show you my code? I would like to learn more about why I have to pass the headers argument and change the User-Agent... – aurumpurum Jun 12 '21 at 07:25
  • @aurumpurum Make sure you don't get captcha page and relevant tags are still there. – Andrej Kesely Jun 12 '21 at 07:25
  • I am not sure what you mean...does the get() method give me a captcha page? How can I tell that? `res = requests.get('https://www.google.com/search?q=boring+stuff')`then `soup = bs4.BeautifulSoup(res.text, 'html.parser')`then `linkElems = soup.select('a:has(h3)')` and then linkElems is empty... – aurumpurum Jun 12 '21 at 07:35
  • Ok, I have figured out that it has to do with that User Agent "thing" (headers argument)...I just couldn't find out what it really does...https://docs.python-requests.org/en/master/user/quickstart/#custom-headers Could you explain me, why we need this `headers`argument? – aurumpurum Jun 12 '21 at 07:56
  • @aurumpurum What HTTP header you need depends on the server . For example, Google requires `User-Agent`, but other servers requires different ones. It doesn't depend on `requests` module. – Andrej Kesely Jun 12 '21 at 08:44
  • Thanks, nice to know! Seems to be an important point for someone who wants to do webscraping! Can we generalize that? How can I check which User Agent I have to use? Where can I read more about this topic? – aurumpurum Jun 12 '21 at 09:13
1

Reading the question again, if you want to open google search links, you should use the existing tools: Searching in Google with Python

In particular, the awesome google search package: https://pypi.org/project/googlesearch-python/

Best not to reinvent the wheel unless the existing package can't do what you want (or if you're trying to learn bs4).

Edit: Re-re-reading the question, you asked specifically about beautifulsoup. My bad.

William
  • 381
  • 1
  • 8