0

Im trying to scrape the title from a website, but it is only returning 1 title. How can I get all the titles?

Below is one of the elements Im trying to fetch using xpath (starts-with):

<div id="post-4550574" class="post-box    " data-permalink="https://hypebeast.com/2019/4/undercover-nike-sfb-mountain-sneaker-release-info" data-title="The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date"><div class="post-box-image-container fixed-ratio-3-2">

This is my current code:

from selenium import webdriver
import requests
from bs4 import BeautifulSoup as bs

driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get('https://hypebeast.com/search?s=nike+undercover')

element = driver.find_element_by_xpath(".//*[starts-with(@id, 'post-')]")
print(element.get_attribute('data-title'))

Output: The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date

I was expecting a lot more title but only returning one result.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Hachi
  • 179
  • 14

4 Answers4

1

To extract the product titles from the website as the desired elements are JavaScript enabled elements you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • XPATH:

    driver.get('https://hypebeast.com/search?s=nike+undercover')
    print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2/span")))])
    
  • CSS_SELECTOR:

    driver.get('https://hypebeast.com/search?s=nike+undercover')
    print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2>span")))])
    
  • Console Output:

    ['The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date', 'The UNDERCOVER x Nike SFB Mountain Surfaces in "Dark Obsidian/University Red"', 'A First Look at UNDERCOVER’s Nike SFB Mountain Collaboration', "Here's Where to Buy the UNDERCOVER x Gyakusou Nike Running Models", 'Take Another Look at the Upcoming UNDERCOVER x Nike Daybreak', "Take an Official Look at GYAKUSOU's SS19 Footwear and Apparel Range", 'UNDERCOVER x Nike Daybreak Expected to Hit Shelves This Summer', "The 10 Best Sneakers From Paris Fashion Week's FW19 Runways", "UNDERCOVER FW19 Debuts 'A Clockwork Orange' Theme, Nike & Valentino Collabs", 'These Are the Best Sneakers of 2018']
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    Thank you Debanjan for providing the answer. Your answer was the only that worked for me. For anyone visiting the thread, I tried using `find_elements` but it failed by throwing the error `no attribute was found` I'm still a newbie in this area hence I would need to find out why in this particular case `elements` would not work. @ DebanjanB - would you know the answer to that? – Hachi Apr 12 '19 at 00:48
  • 1
    I have updated the verbatim of my answer. As the the product titles from the website as the desired elements are [JavaScript](https://www.javascript.com/) enabled elements you have to induce _WebDriverWait_ for the `visibility_of_all_elements_located()` as `find_elements` alone will return with **0** elements – undetected Selenium Apr 12 '19 at 09:43
  • 1
    @DebanjanB- Thank you for providing further details it certainly helps me learn more and prevent duplicate questions. – Hachi Apr 13 '19 at 06:55
1

You don't need selenium. You can use requests, which is faster, and target the data-title attribute

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://hypebeast.com/search?s=nike+undercover')
soup = bs(r.content, 'lxml')
titles = [item['data-title'] for item in soup.select('[data-title]')]
print(titles)

If you do want selenium matching syntax is

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://hypebeast.com/search?s=nike+undercover')
titles = [item.get_attribute('data-title') for item in driver.find_elements_by_css_selector('[data-title]')]
print(titles)   
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • What didn't work with this answer? Did you get an error message? I tried both successfully. – QHarr Apr 12 '19 at 05:13
0

If a locator finds multiple elements then find_elemnt returns the first element. find_elements returns a list of all elements found by the locator.
Then you can iterate the list and get all the elements.

If all of the elements you are trying to find has the class post-box then you could find the elements by class name.

S Ahmed
  • 1,454
  • 1
  • 8
  • 14
0

Just sharing my experience and what I've used, might help someone. Just use,

element.get_attribute('ATTRIBUTE-NAME')
noman404
  • 928
  • 1
  • 8
  • 23