How to extract the number of comments of a youtube video using Google Chrome Headless and Selenium?

Question

There is a element in every youtube webpage to show how many comments for the video. It is such a html structure:

<yt-formatted-string class="count-text style-scope ytd-comments-header-renderer">xx Comments</yt-formatted-string>

I want to get the number xx Comments with selenium.
code1-with head browser

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
proxy = '127.0.0.1:1080'   
options.add_argument('--proxy-server=socks5://' + proxy)
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver,30)
url='https://www.youtube.com/watch?v=N0lxfilGfak'

driver.get(url)
driver.execute_script("return scrollBy(0, 1000);")
comment = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[contains(., 'Comments')]")))
driver.execute_script("arguments[0].scrollIntoView(true);",comment)
print(driver.find_element_by_xpath("//h2[@id='count']").text)

With the above python code ,i can get 717 Comments for https://www.youtube.com/watch?v=N0lxfilGfak.

Now i want to get the same number with headless browser in selenium.
code2-with headless browser.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
proxy = '127.0.0.1:1080'   
options.add_argument('--proxy-server=socks5://' + proxy)
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--headless")
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver,30)
url='https://www.youtube.com/watch?v=N0lxfilGfak'

driver.get(url)
driver.execute_script("return scrollBy(0, 1000);")
comment = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[contains(., 'Comments')]")))
driver.execute_script("arguments[0].scrollIntoView(true);",comment)
print(driver.find_element_by_xpath("//h2[@id='count']").text)

Note:there are three lines more in code2 than code1.

options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--headless")

Other lines are same both in code2 and code1.

It get stuck in comment statement when to execute code2:

>>> comment = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[contains(., 'Comments')]")))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Why can't get the element with headless browser in selenium?

undetected Selenium · Accepted Answer · 2020-07-12T06:19:19.153

You were almost there. To print the text xx Comments using Selenium driven ChromeDriver initiated google-chrome Browsing Context you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

Using XPATH and text attribute:

driver.get("https://www.youtube.com/watch?v=N0lxfilGfak")
driver.execute_script("return scrollBy(0, 1000);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH,"//h2[@id='count']/yt-formatted-string"))).text)

Using CSS_SELECTOR and get_attribute():

driver.get("https://www.youtube.com/watch?v=N0lxfilGfak")
driver.execute_script("return scrollBy(0, 1000);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2#count>yt-formatted-string"))).get_attribute("innerHTML"))

Console Output:
```
717 Comments
```

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Using Headless Chrome

Using google-chrome-headless you can use the following solution:

Code Block:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions() 
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--headless')
options.add_argument('--window-size=1920,1080')
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.youtube.com/watch?v=N0lxfilGfak")
driver.execute_script("return scrollBy(0, 1000);")
subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
# using xpath and text attribute
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH,"//h2[@id='count']/yt-formatted-string"))).text)
# using cssSelector and get_attribute()
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2#count>yt-formatted-string"))).get_attribute("innerHTML"))
print("Exiting")
driver.quit()

Console Output:
```
717 Comments
717 Comments
Exiting
```

score 0 · Answer 2 · answered Jul 12 '20 at 07:11

0

Add line in my setting for headless setting:

options.add_argument('--window-size=1920,1080')

Or make scroll more longer in y direction.

driver.execute_script("return scrollBy(0, 5000);")

My xpath expression is more direct.

answered Jul 12 '20 at 07:11

showkey

482
42
140
295

Ideally, you should use `add_argument('--window-size=1920,1080')` else it would be tougher to guess a good value to `scrollBy()` as the display isn't present. – undetected Selenium Jul 12 '20 at 07:36

How to extract the number of comments of a youtube video using Google Chrome Headless and Selenium?

2 Answers2

Using Headless Chrome