Unable to extract specific text from XPaths in python for online URLs

Question

I want to extract specific text from specific xPaths from this url
https://www.discogs.com/it/artist/148415-Total-Eclipse-4

but nothing is displayed in the console. I expect the following output instead:

Artista 1: Total Eclipse (4)
Testo elemento 1: Jungle Fever
Testo elemento 2: Come Together

My script works perfectly for local .html files, but it doesn't seem to work when I try to use URLs. Here's the script I'm using:

import requests
from lxml import html

# Ask the user to enter the Discogs URL
url = input("Enter the Discogs URL: ")

# Make an HTTP request to get the HTML content of the page
response = requests.get(url)
html_content = response.text

# Parse HTML using lxml
tree = html.fromstring(html_content)

# Use a generic XPath to capture as many "h1" and "a" elements as desired
elements = tree.xpath('//*[starts-with(local-name(), "div")][4]/div/div[1]/div[1]/div/h1 | //*[starts-with(local-name(), "div")][4]/div/div[2]/div[2]/div/table//*[starts-with(local-name(), "tr")][2]/td[5]/a | //*[starts-with(local-name(), "div")][4]/div/div[2]/div[2]/div/table//*[starts-with(local-name(), "tr")]/td[5]/a')

# Counters to keep track of "a" item artists and lyrics
artist_counter = 1
text_counter = 1

# Print the text of the "h1" and "a" elements found in sequence
for element in elements:
    if element.tag == 'h1':
        print(f"Artista {artist_counter}: {element.text.strip()}")
        artist_counter += 1
    else:
        print(f"Testo elemento {text_counter}: {element.text.strip()}")
        text_counter += 1

That web page is probably populated using javascript after the page loads, which means you'll never see the data you want when using the `requests` module. You would either need to use an actual browser (e.g. via selenium or playwright), or you would need to examine the javascript calls and see if there's a URL you can retrieve that returns the actual data. — larsks, Jul 31 '23 at 00:36

Ajeet Verma · Accepted Answer · 2023-08-01T01:29:35.993

It's a dynamic webpage and different pieces of information are getting loaded with Javascript. So, using simple requests will not work as it only returns the static web content.

Here's how you can accomplish your task using Selenium:

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = Chrome()
wait = WebDriverWait(driver, 10)

url = "https://www.discogs.com/it/artist/148415-Total-Eclipse-4"
driver.get(url)

title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1[class^="title_"]'))).text
print(f"Artista 1: {title}")

container = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div[class^="textWithCovers_"]>div')))

for i, element in enumerate(container, start=1):
    artist = element.find_element(By.CSS_SELECTOR, 'td[class^="title_"]> a').text
    print(f"Testo elemento {i}: {artist}")

output:

Artista 1: Total Eclipse (4)
Testo elemento 1: Jungle Fever
Testo elemento 2: Come Together

if you notice, we've used Selenium WebDriver Finding Element by Partial Class Name to locate the element since there're few dynamic letters at the end in the class names. for example class="textWithCovers_2o9C3"

Just one question, I tried to search for the `"textWithCovers_"` class but I didn't find that class, how did you find it on that URL? With DevTools I can't find it — Peter Long, Jul 31 '23 at 15:11
Please check the answer, I've updated it with the explanation to answer your question. — Ajeet Verma, Aug 01 '23 at 01:30
Thank you for your tips, but they didn't work for my other solution. I used Selenium like you, but I was still wrong. Could you help me with this, please?https://superuser.com/questions/1801841/nosuchelementexception-while-scraping-data-from-discogs-url-using-selenium — Peter Long, Aug 02 '23 at 12:36
I've answered your 2nd question there https://superuser.com/questions/1801841/nosuchelementexception-while-scraping-data-from-discogs-url-using-selenium/1801975#1801975 — Ajeet Verma, Aug 02 '23 at 12:53

Unable to extract specific text from XPaths in python for online URLs

1 Answers1