1

I have a function that grabs the source code of YouTube clips and then tries to find: startTimeMs, endTimeMs, and videoId.

This is the videoId block:

class className():
    def __init__(self, link)
        # make the request
        self.r = requests.get(link)

    def originalVideoID(self):
        # get the source code
        source = str(self.r.content)

        # these are the endpoints in which the videoID is enclosed
        start = "\"videoDetails\":{\"videoId\":\""
        end = '\"'

        # gets everything right of videoDetails
        videoID = source.split(start)[1]

        # gets everything left of the quote
        videoID = videoID.split(end)[0]

Expected Outcome:

If given a YouTube Clip URL like: https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,

videoID should consistently be NiXD4xVJM5Y.

Actual Outcome:

  • Sometimes, the expected outcome occurs.
  • Other times, I get an IndexError from line 15.

When debugging this:

I added start in source to line 14 which returns False whenever IndexError is thrown.
I have printed str(self.r.content) which is where I can see the source code is completely different.

What am I doing wrong? Is this a case to use another tool like selenium or perhaps, I using requests wrong or I should approach this differently?

EDIT: This is the traceback on the error

Traceback (most recent call last):
  File "PATHTOPROJECT\FILENAME.py", line 383, in <module>
    main()
  File "PATHTOPROJECT\FILENAME.py", line 165, in download_video
    downloadLink = className(link).originalVideoID()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATHTOPROJECT\FILENAME.py", line 67, in originalVideoID
    videoID = source.split(start)[1]
              ~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

The data that I am seeking in the source code is within this script:

<script nonce="601b9hyYx1NEaPf0pQewqA">
    var ytInitialPlayerResponse = 
    {
        ...
        "videoDetails":
        {
            "videoId":"NiXD4xVJM5Y", ...
        },
        ...
        "clipConfig":
        {
            "postId": ... ,"startTimeMs":"0","endTimeMs":"15000"
        }
    }
futium
  • 90
  • 1
  • 9
  • Can you give us the full traceback on one of these errors? – SimonUnderwood May 10 '23 at 22:20
  • It's JSON. Why don't you just use `json.loads` to convert it to a Python structure instead of hacking the string contents? – Tim Roberts May 10 '23 at 22:32
  • @TimRoberts would I still use requests.get().content to obtain the source code and then use json.loads on it or would json.loads avoid this? – futium May 10 '23 at 22:41
  • `json.loads` merely helps you interpret the results you get back. It doesn't change the way you fetch the results. – Tim Roberts May 10 '23 at 22:46
  • could you please tell us what `str(self.r.content)` contains when error happens? Also, could you check what `self.r.status_code` value is when error happens? – Muhammad Nizami May 10 '23 at 23:14
  • it also would be nice if you can tell us the list of links on which this error happens. – Muhammad Nizami May 10 '23 at 23:22
  • ...it appears to me you are trying to scrape the youtube website? with the provided URL your code _always_ produces an index out of range exception on my side...and the string you are using to split() doesn't appear at all in what I receive from youtube...I can only assume there is some vital javascript that doesn't get executed which loads the expected content.... – mrxra May 10 '23 at 23:27
  • @mrxra the data I am trying to get is within a script which has a bunch of data. If this is within JavaScript, is there a way to way for this to load first? Or a better way to access this? – futium May 10 '23 at 23:48
  • ...hmm...yes, javascript. youtube requires the client to accept the tracker cookies, then then reloads the page...which requests of course cannot handle. you could use selenium webdriver i guess. – mrxra May 11 '23 at 00:10

1 Answers1

1

download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path_to_chromedriver="chromedriver/chromedriver"
video_url = 'https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs'

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)

driver.get(video_url)

# find "accept all" button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value="button") if b.accessible_name and b.accessible_name.lower() == 'accept all'][0]
b.submit()

# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
    element_present = EC.presence_of_element_located((By.CLASS_NAME, 'html5-video-container '))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print("Timed out waiting for page to load")

video_id = driver.page_source.split('"videoDetails":{"videoId":"')[1]
video_id = video_id.split('"')[0]
print(video_id)

driver.quit()

output:

NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you :-)

mrxra
  • 852
  • 1
  • 6
  • 9
  • Thank you. This works but it isn't 100% consistent, any advice? It looks like adding the argument "--headless" makes it more consistent but this could just be a fluke. – futium May 11 '23 at 01:37
  • 1
    afraid not...I have absolutely no experience with web scraping. not sure what you mean by "isn't 100% consistent" but if you mean by it that it sometimes doesn't work, then my advice would be to change the HTML node in "presence_of_element_located". chrome loads like a zillion of resources (such as e.g. the js script you are parsing) asynchronously in non-deterministic order....so i would try to identify the – mrxra May 11 '23 at 06:59