Unable to scrape right number of videos and images using Selenium in Python

Question

I searched a lot on SO, but most of the answers were not able to solve my problem:

What I am trying to scrape:
- I have a link. The link redirects to a dynamic website.
- I want to get the number of videos and number of images residing on this link.
- I want to do it using bs4, Selenium and Python.
What problem I am facing:
- When I check the "inspect element" and do a simple Ctrl+F to find the videos tags. I can see the right amount of videos. But, when I open the "view source" of the same page, I can see only 1 video tag.

Furthermore, when I try to scrape, I am able to retrieve only 1 video. I do not know why the other videos tags are not being detected by bs4. I am assuming this has something to do with Javascript loaded pages. But, even when I use the below code, with Selenium, I am still not able to get the correct number of videos and images

This is the code I have tried:

driver = webdriver.Chrome()
driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
res = driver.execute_script('return document.documentElement.outerHTML')
soup = BeautifulSoup(res, 'html.parser')

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

c=1
for vidL in soup.find_all("div", {'class': 'play_button_container absolute-center has_played_hide'}):
    print(vidL)
    print(c)
    c+=1

the class you are looking for "play_button_container..." is in a
tag, not — pcalkins, Jan 30 '20 at 20:16
Sorry, that was a typo while copy pasting the code. I have changed it. It does not work with div. — Sssssuppp, Jan 30 '20 at 20:21
Can you try replacing the third line with this `res = driver.page_source`? — Ali, Jan 30 '20 at 20:23

score 2 · Accepted Answer · answered Jan 30 '20 at 21:46

Since the data rendered by javascripts you need to wait for element to be visible before use Beautiful soup.

Code:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".play_button_container")))
res = driver.page_source
soup = BeautifulSoup(res, 'html.parser')
c=1
for vidL in soup.find_all("div", {'class': 'play_button_container absolute-center has_played_hide'}):
    print(vidL)
    print(c)
    c+=1

Output On console:

<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
1
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
2
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
3
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
4
<div class="play_button_container absolute-center has_played_hide">
<button aria-label="Play video" class="play_button_big play_button_dark radius2px" type="button">
<span aria-hidden="true" class="ksr-icon__play"></span>
Play
</button>
</div>
5

undetected Selenium · Answer 2 · 2020-01-30T20:44:05.697

To print the number of videos you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

Using CSS_SELECTOR:

driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
print(len(WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.play_button_container.absolute-center.has_played_hide")))))

Using XPATH:

driver.get("https://www.kickstarter.com/projects/evolutionwear/fast-solar-charging-that-fits-in-your-pocket/?ref=kicktraq")
print(len(WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='play_button_container absolute-center has_played_hide']")))))

Console Output:
```
5
```

Unable to scrape right number of videos and images using Selenium in Python

2 Answers2