Using Python & Selenium to Extract YouTube Captions

Question

I found python script (2018) on github for extracting YouTube transcripts.

I fixed line 37 (deprecated) from:

driver = webdriver.Firefox(firefox_options=options)

to

driver = webdriver.Firefox(options=options)

I have a file named url.csv

It has a header row 'url'

There is one url on line 2 of the csv for testing

Lines 2 & 3 of captions.py have been modified from:

filename = 'videolist_zembla_273_2018_05_25-09_17_02.tab'
colname = 'videoId'

To:

filename = 'url.csv'
colname = 'url'

All files are in a folder named 'subtitles' with geckodriver.exe

The script runs until the 3rd exception 'could not find transcript in options menu' then fails.

I have tried different urls with no success, and suspect it may be a timeout issue, though I really have no clue what I'm doing or how to fix it.

Can anyone help me troubleshoot this further, I'm stumped at this point.

Any help appreciated.

Any reason for not using [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) and [Captions: download](https://developers.google.com/youtube/v3/docs/captions/download) endpoints? Otherwise if you don't want to use YouTube Data API v3, there is [this reverse-engineered YouTube UI solution](https://stackoverflow.com/a/70013529/7123660). — Benjamin Loison, Oct 01 '22 at 15:04
It took me all day just to get python running again, it's been a couple of years since I last tried to do anything with it. After numerous searches I found the github script and latched on to it, and it feels like I'm most of the way there. If I can't do it this way then maybe I'll look at the API approach, but at this point it feels like I've invested too much time getting as far as I have to give up on it now. Thanks for the links! — pglove, Oct 01 '22 at 15:12
@Benjamin Loison, I went down a rabbit hole and ended up 'making' [this monstrosity](https://stackoverflow.com/questions/73938180/adding-a-csv-loop-to-python-script/73962642#73962642). Thanks for mentioning the API, it led me to some good resources. — pglove, Oct 05 '22 at 15:30

score 2 · Answer 1 · answered Oct 02 '22 at 17:08

I've managed to get it to work by changing the line

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#items > ytd-menu-service-item-renderer:nth-child(2) > yt-formatted-string"))) #items > ytd-menu-service-item-renderer:nth-child(2) > yt-formatted-string

to

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.TAG_NAME, "ytd-menu-service-item-renderer")))

The problem was that the YouTube site used the ID 'items' on multiple elements which resulted in the CSS_SELECTOR selecting the wrong element. Additionally, I needed to change the line

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.CSS_SELECTOR, "ytd-transcript-body-renderer.style-scope")))

to

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.CSS_SELECTOR, "ytd-transcript-segment-list-renderer")))

Thanks so much, I've been going insane playing with the selectors. Not to mention that once I got it working I decided to try and get it to toggle off the timestamp. I've begun working on a much simpler approach now, using what I've learned over the weekend. It's still a mess but I'm getting there. — pglove, Oct 03 '22 at 16:02

Using Python & Selenium to Extract YouTube Captions

1 Answers1