0

I'm trying to extract the first link from a page search, using beautiful soup, but it can't find the link for some reason.

from requests import get
from bs4 import BeautifulSoup
import requests

band = "it's my life bon jovi"
url = f'https://www.letras.mus.br/?q={band}'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')


linkurl = soup.find_all("div", class_="wrapper")
for urls in linkurl:
    
    print(urls.get('href'))
    #print(soup.a['href']) -- return /
    #print(soup.a['data-ctorig]) -- return nothing

I would like to get the link of the data-ctorig or the href, does this link have a script that is preventing me from looking for this information, or is it a problem with my code?

enter image description here

Lucas Tesch
  • 147
  • 1
  • 3
  • 11

1 Answers1

1

The website uses google programmable search engine (CSE) to return cached results. This required JavaScript to run in a browser which doesn't happen with requests.

It is far easier to use selenium and a more targeted css selector list to retrieve results.

While the wait doesn't seem to be needed in this case I have added it for good measure.

from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait 

band = "it's my life bon jovi"
url = f'https://www.letras.mus.br/?q={band}'
d = webdriver.Chrome()
d.get(url)
links = WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".gsc-thumbnail-inside .gs-title[target]")))
links = [link.get_attribute('href') for link in links]
print(links[0])
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Is there any way to get the background information without using selenium? because my application only allowed backend – Lucas Tesch Jun 08 '22 at 12:22
  • Not easily that I can see. There are several steps to getting this data by XHR – QHarr Jun 10 '22 at 00:13
  • 1
    @LucasTesch what about [`chrome_options.add_argument("--headless")`](https://stackoverflow.com/a/53657649/15164646) to run `selenium` in headless mode? Or you can render the page in headless mode and then pass [`page_source`](https://stackoverflow.com/a/16114362/15164646) to `bs4` as you would do with `requests`. In both ways, you can run the script in docker or whatever. – Dmitriy Zub Aug 24 '22 at 06:03