1

I'm trying to get information from this URL:

https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event

I want to extract the text "Hot 8 Brass Band are a Grammy-nominated New Orleans based brass band, whose sound... " etc.

My approach: I want to extract the info without using the explicit div name (since that tends to change.) So, I identify the "About Hot 8 Brass Band" using a variable, and then I want to access following-siblings and child divs, etc.

Code:

url = "https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event"

driver.get(url)


#Get artist
try:
    artist = driver.find_elements_by_css_selector('a[href^="https://www.bandsintown.com/a/"] h1')
    artist = artist[0].text
    print(artist)
except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
    print ("artist doesn't exist")



#Get Bio Info
try:
    readMoreBio = driver.find_element_by_xpath("//div[text()='Read More']").click()
    print("Read More Bio Clicked")
except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
    pass



#Once read more clicked, get full bio info
try:
    artistBioDiv = driver.find_elements_by_xpath("(//div[text()='About " + artist + "'])[0]/following-sibling/following-sibling::div")
    print("artistBioDiv is: ", artistBioDiv)

except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
    print ("artist bio doesn't exist")

This just seems to access an empty element, i.e. it's not finding the bio paragraph.

Here's the HTML structure:

enter image description here

halfer
  • 19,824
  • 17
  • 99
  • 186
DiamondJoe12
  • 1,879
  • 7
  • 33
  • 81

2 Answers2

1

I think the problem is with the XPATH you used to find the bio.

A few things you could consider for your future projects:

  • Use driver.find_element(By.CSS_SELECTOR, 'CSS_SELECTOR_GOES_HERE') or driver.find_element(By.XPATH, 'XPATH_GOES_HERE') since find_elements_by_xpath and find_elements_by_css_selector are deprecated
  • Use WebDriverWait to allow enough time for elements to be loaded
  • You could also use normalize-space() while matching text in xpath as it takes care of leading or trailing spaces

This code should work for you:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.chrome.options import Options
from time import sleep


options = Options()
options.add_argument("--disable-notifications")

driver = webdriver.Chrome(executable_path='D://chromedriver/100/chromedriver.exe', options=options)
wait = WebDriverWait(driver, 20)

url = "https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event"

driver.get(url)

try:
    # with xpath
    # artist = wait.until(EC.presence_of_element_located((By.XPATH, '//h1[contains(@href, "https://www.bandsintown.com/a")]'))).text
    artist = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1[href^="https://www.bandsintown.com/a/"]'))).text
    
    # read more
    wait.until(EC.presence_of_element_located((By.XPATH, '//div[normalize-space()="Read More"]'))).click()
    
    # bio
    bio = wait.until(EC.presence_of_element_located((By.XPATH, f'//div[normalize-space()="About {artist}"]/following-sibling::div/div[2]/div'))).text
    print(f'Artist: {artist}\nBio:\n{bio}')
except Exception as ex:
    print(f"Error: {ex})
Shine J
  • 798
  • 1
  • 6
  • 11
0

To extract the text ...Hot 8 Brass Band are a Grammy-nominated New Orleans based brass band, whose sound... ... you can use either of the following locator strategies:

  • Using xpath and text attribute:

    driver.get("https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event")
    print(driver.find_element(By.XPATH, "//div[@id='main']//div[text()='About Hot 8 Brass Band']//following-sibling::div[1]//div/div[contains(., 'Hot 8 Brass Band')]").text)
    

Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using XPATH and get_attribute("innerHTML"):

    driver.get("https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='main']//div[text()='About Hot 8 Brass Band']//following-sibling::div[1]//div/div[contains(., 'Hot 8 Brass Band')]"))).get_attribute("innerHTML"))
    
  • Console Output:

    Hot 8 Brass Band are a Grammy-nominated New Orleans based brass band, whose sound draws on the traditional jazz heritage of New Orleans, alongside more modern styles incl...
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352