0

I found python script (2018) on github for extracting YouTube transcripts.

I fixed line 37 (deprecated) from:

driver = webdriver.Firefox(firefox_options=options)

to

driver = webdriver.Firefox(options=options)

I have a file named url.csv

It has a header row 'url'

There is one url on line 2 of the csv for testing

Lines 2 & 3 of captions.py have been modified from:

filename = 'videolist_zembla_273_2018_05_25-09_17_02.tab'
colname = 'videoId' 

To:

filename = 'url.csv'
colname = 'url'

All files are in a folder named 'subtitles' with geckodriver.exe

The script runs until the 3rd exception 'could not find transcript in options menu' then fails.

I have tried different urls with no success, and suspect it may be a timeout issue, though I really have no clue what I'm doing or how to fix it.

Can anyone help me troubleshoot this further, I'm stumped at this point.

Any help appreciated.

pglove
  • 133
  • 1
  • 9
  • 1
    Any reason for not using [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) and [Captions: download](https://developers.google.com/youtube/v3/docs/captions/download) endpoints? Otherwise if you don't want to use YouTube Data API v3, there is [this reverse-engineered YouTube UI solution](https://stackoverflow.com/a/70013529/7123660). – Benjamin Loison Oct 01 '22 at 15:04
  • It took me all day just to get python running again, it's been a couple of years since I last tried to do anything with it. After numerous searches I found the github script and latched on to it, and it feels like I'm most of the way there. If I can't do it this way then maybe I'll look at the API approach, but at this point it feels like I've invested too much time getting as far as I have to give up on it now. Thanks for the links! – pglove Oct 01 '22 at 15:12
  • @Benjamin Loison, I went down a rabbit hole and ended up 'making' [this monstrosity](https://stackoverflow.com/questions/73938180/adding-a-csv-loop-to-python-script/73962642#73962642). Thanks for mentioning the API, it led me to some good resources. – pglove Oct 05 '22 at 15:30

1 Answers1

2

I've managed to get it to work by changing the line

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#items > ytd-menu-service-item-renderer:nth-child(2) > yt-formatted-string"))) #items > ytd-menu-service-item-renderer:nth-child(2) > yt-formatted-string

to

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.TAG_NAME, "ytd-menu-service-item-renderer")))

The problem was that the YouTube site used the ID 'items' on multiple elements which resulted in the CSS_SELECTOR selecting the wrong element. Additionally, I needed to change the line

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.CSS_SELECTOR, "ytd-transcript-body-renderer.style-scope")))

to

element = WebDriverWait(driver, waittime).until(EC.presence_of_element_located((By.CSS_SELECTOR, "ytd-transcript-segment-list-renderer")))
Stefan
  • 241
  • 5
  • Thanks so much, I've been going insane playing with the selectors. Not to mention that once I got it working I decided to try and get it to toggle off the timestamp. I've begun working on a much simpler approach now, using what I've learned over the weekend. It's still a mess but I'm getting there. – pglove Oct 03 '22 at 16:02