I am trying to scrape a given number results from google search, but I so far I came across two problems: one is that I don't know how to join the URLs and the titles inside the same loop, so they can be shown together in the format:
(Title)
(Website URL)
(---------)
(Title)
(Website URL)
(---------)
I somehow managed to achieve this format, but the loop is going on several times, instead of just showing the top 10 results. I believe it's something to do with how I structured the loops to work together, but I don't know how to avoid this.
The other problem is that I want to obtain both main URL and title of each website within search results, but while I managed to get the right titles, I seem to be getting many links coming from the same website, instead of only the main URL. For instance, if I search for "data science", the second or third title shown is from Coursera, while the link is from wikipedia. I only want the main URL so the title matches the website URL, how do I get it?
Any input will be greatly appreciated
import requests
from bs4 import BeautifulSoup
import re
query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
soup_title = BeautifulSoup(requests_results.text,"html.parser")
links = soup_link.find_all("a")
heading_object=soup_title.find_all( 'h3' )
for link in links:
for info in heading_object:
get_title = info.getText()
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print(get_title)
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
print("------")