How to scrape the href attributes of the top 10 clips from https://www.twitch.tv/directory/game/Overwatch/clips?range=7d using Selenium and Python

Question

I have been having a consistent issue during webscraping of receiving an empty string instead of the expected results (based on inspect page html).

My specific goal is to get the link for the top 10 clips from https://www.twitch.tv/directory/game/Overwatch/clips?range=7d.

Here is my code:

# Gathers links of clips to download later

import bs4
import requests
from selenium import webdriver
from pprint import pprint
import time
from selenium.webdriver.common.keys import Keys


# Get links of multiple clips by webscraping main_url

main_url = 'https://www.twitch.tv/directory/game/Overwatch/clips?range=7d'
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(10)
elements_found = driver.find_elements_by_class_name("tw-interactive tw-link tw-link--hover-underline-none tw-link--inherit")
print(elements_found)

driver.quit()

This is how I decided on the class name

The page uses Javascript and that is the reason why I am using Selenium over the Requests module (which I tried, to no success).

I added the time.sleep(10) so that I have time to scroll through the webpage to activate the java script, to no avail.

I've also tried changing user-agent and using XPaths, neither of which have produced different results.

No matter what I do, it seems that the program only looks at the raw html that is found by right-click -> inspect page source.

Any help and pointers would be greatly appreciated, I feel thoroughly stuck on this problem. I have been having these issues in all projects of "Chapter 11: Webscraping" from Automate the Boring Stuff, and my personal projects.

Guy · Answer 1 · 2019-12-24T05:14:42.353

0

find_elements_by_class_name receive only one class as parameter so elements_found is an empty list. For example

find_elements_by_class_name('tw-interactive')

You are using 4 classes. To do that use css_selector

elements_found = find_elements_by_css_selector('.tw-interactive.tw-link.tw-link--hover-underline-none.tw-link--inherit')

Or explicitly

elements_found = find_elements_by_css_selector('[class="tw-interactive tw-link tw-link--hover-underline-none tw-link--inherit"]')

To get the href attributes from the elements use get_attribute()

for element in elements_found:
    element.get_attribute('href')

edited Dec 24 '19 at 05:14

answered Dec 23 '19 at 09:06

Guy

46,488
10
44
88

Thanks for the quick response! Implementing your changes I get multiple results in the list with the format: [, [NEXT ITEM IN LIST HERE] How would I extract the links from the webpage? This isn't in HTML format so I cannot simply find the "href" value. Also how did you conclude that "tw-interactive tw-link tw-link--hover-underline-none tw-link--inherit" was 4 classes, as opposed to one long class? I greatly appreciate your help! – Niko Raisanen Dec 23 '19 at 23:03
@NikoRaisanen `elements_found` contain `WebElement`s, you are seeing the `__str__` value. Use `get_attribute` to get the attribute (see updated answer). I know there are 4 classes by the spaces between the class names, 3 spaces means 4 classes. – Guy Dec 24 '19 at 05:19

score 0 · Answer 2 · answered Dec 26 '19 at 22:10

As per the documentation of selenium.webdriver.common.by implementation:

class selenium.webdriver.common.by.By
    Set of supported locator strategies.

    CLASS_NAME = 'class name'

So using find_elements_by_class_name() you won't be able to pass multiple class names i.e. tw-interactive, tw-link, tw-link--hover-underline-none and tw-link--inherit. Passing multiple classes you will face the error as:

Message: invalid selector: Compound class names not permitted

You can find a detailed discussion in Invalid selector: Compound class names not permitted using find_element_by_class_name with Webdriver and Python

Solution

As an alternative you can induce WebDriverWait for the visibility_of_all_elements_located() and you can use use either of the following Locator Strategies:

CSS_SELECTOR:

driver.get('https://www.twitch.tv/directory/game/Overwatch/clips?range=7d')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.tw-interactive.tw-link.tw-link--hover-underline-none.tw-link--inherit")))])

XPATH:

driver.get('https://www.twitch.tv/directory/game/Overwatch/clips?range=7d')
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@class='tw-interactive tw-link tw-link--hover-underline-none tw-link--inherit']")))])

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Console Output:

['https://www.twitch.tv/playoverwatch/clip/EnticingCoyTriangleM4xHeh', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/mteelul', 'https://www.twitch.tv/chipsa/clip/AgitatedGenerousFlyBleedPurple', 'https://www.twitch.tv/chipsa', 'https://www.twitch.tv/stracciateiia', 'https://www.twitch.tv/playoverwatch/clip/StormyNimbleJamKappaClaus', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/zenofymedia', 'https://www.twitch.tv/sleepy/clip/BombasticCautiousEmuBIRB', 'https://www.twitch.tv/sleepy', 'https://www.twitch.tv/vlday', 'https://www.twitch.tv/playoverwatch/clip/FinePlainApeGrammarKing', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/supdos', 'https://www.twitch.tv/playoverwatch/clip/MotionlessHomelyWrenchNononoCat', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/theefisch', 'https://www.twitch.tv/sonicboom83/clip/WanderingInspiringConsoleM4xHeh', 'https://www.twitch.tv/sonicboom83', 'https://www.twitch.tv/vollg1', 'https://www.twitch.tv/chipsa/clip/PunchyStrongPonyStrawBeary', 'https://www.twitch.tv/chipsa', 'https://www.twitch.tv/stracciateiia', 'https://www.twitch.tv/overwatchcontenders/clip/SavoryArtisticMelonEleGiggle', 'https://www.twitch.tv/overwatchcontenders', 'https://www.twitch.tv/asingledrop', 'https://www.twitch.tv/playoverwatch/clip/TubularLuckyLocustOptimizePrime', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/taipan20', 'https://www.twitch.tv/harbleu/clip/StrongStrongSushiDoggo', 'https://www.twitch.tv/harbleu', 'https://www.twitch.tv/aimmoth', 'https://www.twitch.tv/supertf/clip/GrossSmoothDolphinAMPTropPunch', 'https://www.twitch.tv/supertf', 'https://www.twitch.tv/tajin_ow', 'https://www.twitch.tv/playoverwatch/clip/TransparentCaringPoxVoteNay', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/nepptuneow', 'https://www.twitch.tv/space/clip/CharmingPeppyMetalFunRun', 'https://www.twitch.tv/space', 'https://www.twitch.tv/pantangelicious', 'https://www.twitch.tv/chipsa/clip/MoldyBadBananaRlyTho', 'https://www.twitch.tv/chipsa', 'https://www.twitch.tv/mopedinspector', 'https://www.twitch.tv/kephrii/clip/SoftSullenInternTTours', 'https://www.twitch.tv/kephrii', 'https://www.twitch.tv/kephrii', 'https://www.twitch.tv/valentine_ow/clip/GorgeousSincereMinkBleedPurple', 'https://www.twitch.tv/valentine_ow', 'https://www.twitch.tv/stracciateiia', 'https://www.twitch.tv/playoverwatch/clip/SpotlessTenuousTarsierPraiseIt', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/bluecloud123', 'https://www.twitch.tv/jake_ow/clip/TriumphantOptimisticQuailKAPOW', 'https://www.twitch.tv/jake_ow', 'https://www.twitch.tv/ph33rah', 'https://www.twitch.tv/playoverwatch/clip/DreamyDependableCheeseGOWSkull', 'https://www.twitch.tv/playoverwatch', 'https://www.twitch.tv/carrosive']

How to scrape the href attributes of the top 10 clips from https://www.twitch.tv/directory/game/Overwatch/clips?range=7d using Selenium and Python

2 Answers2

Solution