How to web scrape youtube transcripts with Beautifulsoup4 and Python 3

Question

Here is my current code. I am not sure what I am doing wrong. Maybe I am not digging deep enough in the html and giving Beautifulsoup the right tags? At the moment, my code is returning me blanks.

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.youtube.com/watch?v=5_zrHZdhaBU")
soup = BeautifulSoup(html,'html.parser')
nameList = soup.findAll("div", {"id": "cp-2"})
for name in nameList:
    print(name.get_text())

Here is the code that I inspected. I'm trying to get Python to return back to me "but it was untucked"

<div id="cp-2" class="caption-line" data-time="7.54"><div class="caption-line-time">0:07</div><div class="caption-line-text">but it was untucked.</div></div>

***Edit

The code can be found by clicking on "more" next to the share button. Then you click on transcripts and you will see all the text there.

I can't find this line on the page and in the html. What is this line? — Yevhen Kuzmovych, Dec 01 '16 at 23:28
Are you sure this is not loaded dynamically via ajax? Open page source, there may not be such an element in static source. — Andrey Moiseev, Dec 02 '16 at 00:49
@Yevhen Kuzmovych If you go to the youtube page, there is a "more" button next to share. Click on it, then click on transcripts. It is line 0:07. — B Hok, Dec 02 '16 at 03:14
@Andrey Moiseev Maybe it is? I just noticed I do not see in open page source too. I just used google chrome's inspect to find the snippet. I'm looking at the transcript which can be found be clicking on "more" next to the share button. — B Hok, Dec 02 '16 at 03:23
@BHok You can probably find the file that the transcripts are loaded from. "Resources" or "Network" element inspector tabs. — Andrey Moiseev, Dec 02 '16 at 11:15

score 0 · Accepted Answer · edited May 23 '17 at 11:53

0

Oh yes, it's loaded via Ajax: open the page, then open Network tab, sort requests by start time (latest requests first), click CC button on Youtube.

You get api/timedtext request, the response is an XML. Here it the full url to the transcript:

https://www.youtube.com/api/timedtext?signature=1A03D323CBD455E9993B7AC447CA64764FA6FE75.59F4BD2D45A32E89FBF54B418EE2F763283A1007&asr_langs=fr%2Cja%2Cnl%2Ces%2Cru%2Cko%2Cit%2Cde%2Cpt%2Cen&key=yttt1&caps=asr&v=5_zrHZdhaBU&hl=en_US&expire=1480702409&sparams=asr_langs%2Ccaps%2Cv%2Cexpire&lang=en&fmt=srv3

I have no idea how this URL is generated, though. This requires invesigation of complex YouTube scripts, etc.

EDIT: This answer helped me. You can omit most of these parameters and just use this URL:

https://www.youtube.com/api/timedtext?&v=5_zrHZdhaBU&lang=en

Or this in general:

https://www.youtube.com/api/timedtext?&v={video_id}&lang={language_code}

edited May 23 '17 at 11:53

Community

1
1

answered Dec 02 '16 at 11:25

Andrey Moiseev

3,914
7
48
64

Does this mean the transcript can only be scraped by going to another url? And cannot be scraped directly from the page? – B Hok Dec 02 '16 at 15:08
@BHok Yes, you need a different url. You need to extract the `{video_id}` part of your old url, for example, with [this regex](https://regex101.com/r/RuGXmI/2): `v=(?P[a-zA-Z\d_]+)`. Or parse the url with some library and get the `v` parameter, it's a tedious task. And then put into the new one, if you need this to be done automatically. – Andrey Moiseev Dec 02 '16 at 15:21
@BHok If this answer solves your problem, consider [marking it as accepted](http://stackoverflow.com/help/accepted-answer) (green check). – Andrey Moiseev Dec 02 '16 at 15:51

How to web scrape youtube transcripts with Beautifulsoup4 and Python 3

1 Answers1