0

I have been coding a simple python script for obtaining number of views and number of comments of a list of videos. Using csv, I have converted a tab-separated table into a list of lists, and then I tried to obtain both elements. Inspecting number of views, the element is "div", {"class":"watch-view-count"}. It works as intended

r = requests.get(list_youtube_reading[n][0]) # it retrieves each video URL from a csv
soup = BeautifulSoup(r.text)
for element in soup.findAll("div", {"class":"watch-view-count"}): 
    patternviews = re.compile('^(.*?) .*') 
    scissorviews = patternviews.match(element.text.encode("utf-8")) 
    views = re.sub('\.','', tijeraviews.group(1))

However, element for number of comments is <h2 class="comment-section-header-renderer" tabindex="0"> <b>Comments</b> " • 6" <span class="alternate-content-link"></span>
</h2>

When I tried to obtain it, with

for element in soup.findAll("h2", {"class":"comment-section-header-renderer"}):
    comments = element.text.encode("utf-8")
    print comments

nothing happens, and actually soupdoesn't contain any <h2 class="comment-section-header-renderer" tabindex="0"> tag

What can I do in order to retrieve number of comments? I tried to use youtube v3 data API, but for no avail

thanks in advance

  • Could you show the code where you tried to obtain the number of comments? – Kevin Collins Jul 07 '17 at 16:50
  • Yes. I added it. However, the problem is that there was no string in soup equals to `

    `. I'm almost sure that it is not possible to scrape comments number from raw html... although I can be wrong. The closest string that I found is `'COMMENTS_TOKEN': "EhYSC3FPVFpLTDZDUFY4wAEAyAEA4AEDGAY%3D",`

    – Juan Luis Chulilla Jul 07 '17 at 16:54
  • I think you'll need to grab that token value and use it in the separate ajax call I posted about. – Kevin Collins Jul 07 '17 at 17:03

2 Answers2

4

One simple way would be using the Selenium WebDriver to simulate a web browser. I have observed that when we scroll down, only then YouTube loads the comments section. So my solution is to make the web-driver to scroll down and wait until the desired element is found. After it has been located, the following script grabs it and gets the value.

For using Selenium, we need to download one of the third party drivers from this page. I have used the Mozilla GeckoDriver. And we also need to put the path to this executable file in the system environment variables. As I am on an Ubuntu machine, I put the downloaded file (after extracting it) in /usr/local/bin/and I didn't need anything more. After setting the path properly, we can run the following script to get our desired values. And we also do need to install Selenium itself. The instructions are here.

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

video_url = 'https://www.youtube.com/watch?v=NP189MPfR7Q'
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
driver.get(video_url)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
for view_num in driver.find_elements_by_class_name("watch-view-count"):
    print 'Number of views: ' + view_num.text.replace(' views', '')

try:
    element = WebDriverWait(driver, 30).until(
        EC.presence_of_element_located((By.CLASS_NAME, "comment-section-header-renderer")))
    for comment_num in driver.find_elements_by_class_name("comment-section-header-renderer"):
        print u'Number of comments: ' + comment_num.text.replace(u'COMMENTS • ', '')
finally:
    driver.quit()

Output:

Number of views: 3,555
Number of comments: 3

NOTE Since the DOM element (that contains the comment-count) has some NON-ASCII character inside, I needed to put the very first line of the script.

And if you don't like Selenium to show the GUI, follow these instructions. I did not do this but the instructions should be enough.

arif
  • 524
  • 8
  • 18
  • If the answer has solved your problem, please consider 'accepting' it to mark your question as solved. – arif Jul 07 '17 at 22:39
  • Hey Arif, Thanks a LOT!!! I didn't know selenium and it works like a charm. I'm in a windows machine, but installing both it and geckodriver has been seamless. It works half in windows, since I obtained a UnicodeEncodeError with the damn bullet and windows console: 'UnicodeEncodeError: 'charmap' codec can't encode character u'\u2022' in position 32: character maps to .' I'm going to try it ASAP in a linux machine, because it is going to be very useful in different tasks. Thanks again!! – Juan Luis Chulilla Jul 07 '17 at 22:52
  • I'm tempted to mark is as solved, although I would like to comment if finally unicode problem could be solved – Juan Luis Chulilla Jul 07 '17 at 22:53
  • Working with Unicode strings is a different issue. If we don't manipulate that bullet or the contents of those DOM elements, this answer will solve the original problem of getting the 'number of comments'. Besides, I would recommend you to use the Linux machine If you could. Since I am not in Windows, I might be of little help. – arif Jul 07 '17 at 23:14
  • @JuanLuisChulilla, Did you get the same problem with Unicode in you Linux environment too? Or did you solve the problem? – arif Jul 08 '17 at 00:45
  • If you still have the problem, try [this post](https://stackoverflow.com/questions/32382686/unicodeencodeerror-charmap-codec-cant-encode-character-u2010-character-m). Here people had a similar kind of encoding problems and their problem had something to do with the console, not the Python. Let me know if you have tried it with Linux environment whether you get the same problem or not. – arif Jul 08 '17 at 00:51
  • 1
    Thanks again. The best result I obtained was to change console encoding in cmd windows, 'chcp 1252'. I'll inform about linux result tomorrow – Juan Luis Chulilla Jul 10 '17 at 22:01
  • 1
    ok, sorry for the delay. In linux the script works seamessly. No problem with codification, as expected. The more I try windows for this kind of work, the more I realize that linux is best for it. THANKS arif!!! – Juan Luis Chulilla Jul 11 '17 at 09:28
1

It appears the comments section is loaded in a separate ajax request to a URL like this:

https://www.youtube.com/watch_fragments2_ajax?v=zlYDDLCorNw&tr=time&distiller=1&ctoken=EhYSC3psWURETENvck53wAEAyAEA4AEBGAY%253D&frags=comments&spf=load

That returns some json like this:

{
  "name": "other",
  "foot": "<script>...</script>",
  "body": {
    "watch-discussion": " ... <h2 class=\"comment-section-header-renderer\" tabindex=\"0\">\n<b>Comments</b> • 2<span class=\"alternate-content-link\"> ..."
  }
}

In that json is where you'll find the HTML section showing the comment count (in body.watch-discussion).

Kevin Collins
  • 1,453
  • 1
  • 10
  • 16
  • Thanks, kevin. My knowledge level is not very high. When I try to access to that URL using the browser, it only returns `{"reload":"now"}`. Is it supposed that an instance created with BeautifulSoup would return the JSON you showed? Thanks in advance!!! – Juan Luis Chulilla Jul 07 '17 at 18:16
  • `r = requests.get("https://www.youtube.com/watch_fragments2_ajax?v=zlYDDLCorNw&tr=time&distiller=1&ctoken=EhYSC3psWURETENvck53wAEAyAEA4AEBGAY%253D&frags=comments&spf=load")` `soup = BeautifulSoup(r.text)` returns nothing :( – Juan Luis Chulilla Jul 07 '17 at 18:21
  • Open up Chrome dev tools and watch the network requests when you load a video. You'll see this request for the comments is a POST, whereas you're using GET. There are probably other headers you'll need to add as well. It's helpful to use a tool such as http://www.telerik.com/fiddler to see all the details about the request that you'll need to emulate. Also with that tool you can copy requests and rerun them until you figure out exactly what's needed. – Kevin Collins Jul 07 '17 at 18:34
  • 1
    Thanks, kevin. You have open new ways to me. I have to investigate and elaborate them, but I'm pretty sure than they are going to be fruitful. Now I have not enough level for integrating your suggestion in my script, but that's a matter of studying :) – Juan Luis Chulilla Jul 07 '17 at 22:31
  • Anyways, would you mind to exemplify your answer? I'm sure it would be appreciate by other users besides me – Juan Luis Chulilla Jul 07 '17 at 23:10