0

I searched a lot on SO, but most of the answers were not able to solve my problem:

  • What I am trying to scrape:

    • I have a link. The link redirects to a dynamic website.
    • I want to get the number of videos and number of images residing on this link.
    • I want to do it using bs4, Selenium and Python.
  • What problem I am facing:

    • When I check the "inspect element" and do a simple Ctrl+F to find the videos tags. I can see the right amount of videos. But, when I open the "view source" of the same page, I can see only 1 video tag.

Furthermore, when I try to scrape, I am able to retrieve only 1 video. I do not know why the other videos tags are not being detected by bs4. I am assuming this has something to do with Javascript loaded pages. But, even when I use the below code, with Selenium, I am still not able to get the correct number of videos and images

This is the code I have tried:

driver = webdriver.Chrome()
driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
res = driver.execute_script('return document.documentElement.outerHTML')
soup = BeautifulSoup(res, 'html.parser')

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

c=1
for vidL in soup.find_all("div", {'class': 'play_button_container absolute-center has_played_hide'}):
    print(vidL)
    print(c)
    c+=1
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Sssssuppp
  • 683
  • 1
  • 7
  • 29

2 Answers2

2

Since the data rendered by javascripts you need to wait for element to be visible before use Beautiful soup.

Code:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".play_button_container")))
res = driver.page_source
soup = BeautifulSoup(res, 'html.parser')
c=1
for vidL in soup.find_all("div", {'class': 'play_button_container absolute-center has_played_hide'}):
    print(vidL)
    print(c)
    c+=1

Output On console:

<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
1
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
2
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
3
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
4
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
5
KunduK
  • 32,888
  • 5
  • 17
  • 41
1

To print the number of videos you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
    print(len(WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.play_button_container.absolute-center.has_played_hide")))))
    
  • Using XPATH:

    driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
    print(len(WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='play_button_container absolute-center has_played_hide']")))))
    
  • Console Output:

    5
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352