2

I am trying to extract song data from this URL https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF and here is my code trials:

import requests
from bs4 import BeautifulSoup
        
        
URL = "https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF"
page = requests.get(URL)
page_source = BeautifulSoup(page.text, "lxml")
song_tables = page_source.find_all("div", {"data-testid": "tracklist-row"})
        
data = []
for song_table in song_tables:
s = song_table.find_all("span", attrs={"data-encore-id": "type"})
if s:
   data.append([s.text])
   print(data) 

I am getting an empty list. I am new to webscrap

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Santhiya s
  • 61
  • 6
  • The website is most likely dynamically loaded, that is, most of the HTML is likely loaded in by JavaScript. Thus, you would need to execute the JS in order to load the majority of the HTML (which likely contains the element you want). The Selenium module can be used to send the GET request and execute the JS. – Übermensch Mar 09 '23 at 03:00
  • 1
    Also, Spotify offers an API and the python community has built the `spotipy` module that makes it really easy to use. – Übermensch Mar 09 '23 at 03:04

2 Answers2

2

Using Selenium you can easily extract the Name of the songs inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR and text attribute:

    driver.get("https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div#onetrust-close-btn-container > button[aria-label='Close']"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[data-testid=tracklist-row] div[dir=auto][data-encore-id=type]")))])
    
  • Using XPATH and get_attribute("innerHTML"):

    driver.get("https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div#onetrust-close-btn-container > button[aria-label='Close']"))).click()
    print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']//div[@dir='auto' and @data-encore-id='type']")))])
    
  • Console Output:

    ['Flowers', 'TQG', 'Die For You - Remix', 'Kill Bill', "Boy's a liar Pt. 2", 'Shakira: Bzrp Music Sessions, Vol. 53', "Creepin' (with The Weeknd & 21 Savage)", 'As It Was', 'Yandel 150', 'Unholy (feat. Kim Petras)', 'La Bachata', 'Calm Down (with Selena Gomez)', 'X SI VOLVEMOS', 'Die For You', "I'm Good (Blue)", 'Escapism.', 'Hey Mor', 'Anti-Hero', 'Here With Me', 'OMG', 'Until I Found You (with Em Beihold) - Em Beihold Version', 'La Jumpa', 'golden hour']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

Outro

Link to useful documentation:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

I'm not what your expected output is, but it seems like you want to access the text in all of the span tags with data-encore-id="type". The following code does this.

Please note that for the code compile you will need to install the selenium module and a browser driver. You can reference the selenium documentation for steps on how to set it up. You can also use this video as a reference for installing the chromedriver on windows 10.

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
        
# instantiate a Chrome object       
driver = webdriver.Chrome()

url = "https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF"

# send GET request via Chrome driver to execute all of the JS
driver.get(url) 

# add to so that website has enough time to load
sleep(4)

# get all of the current page's html
page_html = driver.page_source

page_source = BeautifulSoup(page_html, "html.parser")
song_tables = page_source.find_all("span", attrs={"data-encore-id": "type"})
for song in song_tables:
    print(song.text)

Übermensch
  • 318
  • 2
  • 11
  • I am trying to extract the song data from this page. it is extracting the songs but not everything in the page. only 23 songs i was able to get it. – Santhiya s Mar 09 '23 at 17:03
  • how can i extract everything – Santhiya s Mar 09 '23 at 17:03
  • I didn't realize not all of the songs were being scraped. The reason they are not all being scraped is because the rest of the page's html is loaded when you scroll down the page; that is, there is JS that executes when the user scrolls down. I tried using some old code to interact with the scroll bar but it does not work and idk why. So your new issue is now "how to scroll down a webpage using selenium and python". – Übermensch Mar 09 '23 at 17:54
  • 1
    https://stackoverflow.com/a/75691865/4827471 - here is the answer – Santhiya s Mar 20 '23 at 15:29