requests.get().content doesn't return the same code consistently

Question

I have a function that grabs the source code of YouTube clips and then tries to find: startTimeMs, endTimeMs, and videoId.

This is the videoId block:

class className():
    def __init__(self, link)
        # make the request
        self.r = requests.get(link)

    def originalVideoID(self):
        # get the source code
        source = str(self.r.content)

        # these are the endpoints in which the videoID is enclosed
        start = "\"videoDetails\":{\"videoId\":\""
        end = '\"'

        # gets everything right of videoDetails
        videoID = source.split(start)[1]

        # gets everything left of the quote
        videoID = videoID.split(end)[0]

Expected Outcome:

If given a YouTube Clip URL like: https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs,

videoID should consistently be NiXD4xVJM5Y.

Actual Outcome:

Sometimes, the expected outcome occurs.
Other times, I get an IndexError from line 15.

When debugging this:

I added start in source to line 14 which returns False whenever IndexError is thrown.
I have printed str(self.r.content) which is where I can see the source code is completely different.

What am I doing wrong? Is this a case to use another tool like selenium or perhaps, I using requests wrong or I should approach this differently?

EDIT: This is the traceback on the error

Traceback (most recent call last):
  File "PATHTOPROJECT\FILENAME.py", line 383, in <module>
    main()
  File "PATHTOPROJECT\FILENAME.py", line 165, in download_video
    downloadLink = className(link).originalVideoID()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "PATHTOPROJECT\FILENAME.py", line 67, in originalVideoID
    videoID = source.split(start)[1]
              ~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

The data that I am seeking in the source code is within this script:

<script nonce="601b9hyYx1NEaPf0pQewqA">
    var ytInitialPlayerResponse = 
    {
        ...
        "videoDetails":
        {
            "videoId":"NiXD4xVJM5Y", ...
        },
        ...
        "clipConfig":
        {
            "postId": ... ,"startTimeMs":"0","endTimeMs":"15000"
        }
    }

It's JSON. Why don't you just use `json.loads` to convert it to a Python structure instead of hacking the string contents? — Tim Roberts, May 10 '23 at 22:32
@TimRoberts would I still use requests.get().content to obtain the source code and then use json.loads on it or would json.loads avoid this? — futium, May 10 '23 at 22:41
`json.loads` merely helps you interpret the results you get back. It doesn't change the way you fetch the results. — Tim Roberts, May 10 '23 at 22:46
could you please tell us what `str(self.r.content)` contains when error happens? Also, could you check what `self.r.status_code` value is when error happens? — Muhammad Nizami, May 10 '23 at 23:14
it also would be nice if you can tell us the list of links on which this error happens. — Muhammad Nizami, May 10 '23 at 23:22
...it appears to me you are trying to scrape the youtube website? with the provided URL your code _always_ produces an index out of range exception on my side...and the string you are using to split() doesn't appear at all in what I receive from youtube...I can only assume there is some vital javascript that doesn't get executed which loads the expected content.... — mrxra, May 10 '23 at 23:27
@mrxra the data I am trying to get is within a script which has a bunch of data. If this is within JavaScript, is there a way to way for this to load first? Or a better way to access this? — futium, May 10 '23 at 23:48
...hmm...yes, javascript. youtube requires the client to accept the tracker cookies, then then reloads the page...which requests of course cannot handle. you could use selenium webdriver i guess. — mrxra, May 11 '23 at 00:10

score 1 · Accepted Answer · answered May 11 '23 at 00:34

download chromedriver matching your version of chrome ( https://chromedriver.chromium.org/downloads). unzip the file. change the path_to_chromedriver in the following script...which accepts the cookie policy, waits for the page to be fully loaded and THEN parses the page content (your code/split logic):

import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path_to_chromedriver="chromedriver/chromedriver"
video_url = 'https://www.youtube.com/clip/UgkxU2HSeGL_NvmDJ-nQJrlLwllwMDBdGZFs'

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
driver = webdriver.Chrome(executable_path=path_to_chromedriver, options=options)

driver.get(video_url)

# find "accept all" button and submit...
b = [b for b in driver.find_elements(by=By.TAG_NAME, value="button") if b.accessible_name and b.accessible_name.lower() == 'accept all'][0]
b.submit()

# https://stackoverflow.com/a/26567563/12693728: wait for page to be loaded. retrieving video id sometimes fails...suppose because of async resources are not being loaded in a deterministic order/time...assume that when the video container is ready, the page is fully loaded...
timeout = 3
try:
    element_present = EC.presence_of_element_located((By.CLASS_NAME, 'html5-video-container '))
    WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
    print("Timed out waiting for page to load")

video_id = driver.page_source.split('"videoDetails":{"videoId":"')[1]
video_id = video_id.split('"')[0]
print(video_id)

driver.quit()

output:

NiXD4xVJM5Y

=> maybe there's a way to have chrome run in headless mode, will leave that to you :-)

Thank you. This works but it isn't 100% consistent, any advice? It looks like adding the argument "--headless" makes it more consistent but this could just be a fluke. — futium, May 11 '23 at 01:37
afraid not...I have absolutely no experience with web scraping. not sure what you mean by "isn't 100% consistent" but if you mean by it that it sometimes doesn't work, then my advice would be to change the HTML node in "presence_of_element_located". chrome loads like a zillion of resources (such as e.g. the js script you are parsing) asynchronously in non-deterministic order....so i would try to identify the — mrxra, May 11 '23 at 06:59

requests.get().content doesn't return the same code consistently

This is the videoId block:

Expected Outcome:

Actual Outcome:

When debugging this:

1 Answers1