0

I have a working code, that prints firstly search titles and then urls but it prints a lot of urls between website titles. But how to print them in format like this and avoid printing the same urls 10 times for each:

1) Title url
2) Title url
and so on... 

My code:

search = input("Search:")

page = requests.get(f"https://www.google.com/search?q=" + search)

soup = BeautifulSoup(page.content, "html5lib")

links = soup.findAll("a")

heading_object = soup.find_all('h3')

for info in heading_object:
    x = info.getText()
    print(x)
    for link in links:
        link_href = link.get('href')
        if "url?q=" in link_href:
            y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
            print(y)
furas
  • 134,197
  • 12
  • 106
  • 148
Silka
  • 194
  • 2
  • 11
  • 1
    I wouldn't expect this to work with just the beautifilsoup and simple get requests as google renders a lot of the page with JS. You might want to look at [Search API](https://developers.google.com/custom-search/v1/overview). – Yevhen Kuzmovych Mar 04 '21 at 14:30
  • Code with: for info in heading_object: x = info.getText() for link in links: link_href = link.get('href') if "url?q=" in link_href: y = (link.get('href').split("?q=")[1].split("&sa=U")[0]) print(x, y) Gives me needed result with exception that it prints out around 10 copies of each result from Google. – Silka Mar 04 '21 at 14:46
  • always put code, data and full error message as text (not screenshot, not link) in question (not comment). – furas Mar 04 '21 at 15:00
  • you should search it in different way - first find object which keeps both `title` and `url` and later search single `title` and `url` inside this object to get it as pair. Eventually you should use `zip(heading_object, links)` to create pairs but it may gives wrong result if some of item (title or link) was empty on page because then it moves other items into this place. – furas Mar 04 '21 at 15:03
  • I edited code. The problem is that it prints out ALL url that it finds on Google Search page. – Silka Mar 04 '21 at 15:09
  • Does this answer your question? [Scrape google search results titles and urls using Python](https://stackoverflow.com/questions/56392962/scrape-google-search-results-titles-and-urls-using-python) – ilyazub Aug 26 '21 at 13:49

2 Answers2

1

If you get separatelly titles and links then you can use zip() to group them in pairs

for info, link in zip(heading_object, links):
    info = info.getText()

    link = link.get('href')
    if "?q=" in link:
        link = link.split("?q=")[1].split("&sa=U")[0]

    print(info, link)

But this may have problem when some title or link doesn't exist on page because then it will create wrong pairs. It will pair title with link for next element. You should rather search elements which keep both title and link and inside every element search single title and single link to create pair. If there is no title or link then you can put some default value and it will not create wrong pairs.

furas
  • 134,197
  • 12
  • 106
  • 148
  • Thank you, it now prints title and link together on one line. But the only issue is that it prints out not full url, like this "/?sa=X&ved=0ahUKEwirhbTIh5fvAhVmyDgGHYGlCscQOwgC" or this "/?output=search&ie=UTF-8&sa=X&ved=0ahUKEwirhbTIh5fvAhVmyDgGHYGlCscQPAgE". I'm not sure how to fix it. – Silka Mar 04 '21 at 16:31
  • I don't get it - if you need full then remove `if`. And you will see what really you get from server. It may send different HTML for different devices (especially if they use wrong header `user-agent` and they don't use `JavaScript`) – furas Mar 04 '21 at 16:51
0

You're looking for this:

for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  url = result.a['href']
  print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.

If you're using f-string then the appropriate way is to use it like so:

page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}")  # proper f-string

Code:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
  'User-agent':
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "python memes",
  "hl": "en"
}

soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')

for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  url = result.a['href']
  print(f'{title}, {url}\n')

--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/

ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en

28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''

Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

One of the differences is that you only need to iterate over JSON rather than figuring out how to scrape stuff.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "python memes",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  url = result['link']
  print(f'{title}, {url}\n')

-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/

ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en

28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35