0

I needed to extract youtube links with their names, from youtube playlists. So I just tried to use SelectorGadget(Chrome Extension) for extracting CSS tag, but when I'm trying to get anything about the like BeautifulSoup returns none, I don't where am I going wrong.

below is the code I wrote:

from os import sys
import requests
from bs4 import BeautifulSoup
import re

try:
    # checking url format
    url_pattern = re.compile("^(?:http|https|ftp):\/\/[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+\.[a-zA-Z0-9_~:\-\/?#[\]@!$&'()*+,;=`^.%]+$") 

    # playlist_url = input("Enter your youtbe playlist url: ")
    # getting input directly from user commandline
    playlist_url = sys.argv[1]

    if not bool(url_pattern.match(playlist_url)) :
        raise ValueError("Enter valid link")

    get_links_from_youtube_playlist(playlist_url)

except ValueError as value_error:
    print(value_error)

then I will pass the URL to another function:


def get_links_from_youtube_playlist(youtube_playlist_url):

    request_response = requests.get(youtube_playlist_url)

    # using "html.parser" lib
    # soup_object = BeautifulSoup(request_response.text, 'html.parser')
    # using "lxml" - Processing XML and HTML with Python
    soup_object = BeautifulSoup(request_response.text, 'lxml')

    # not working?!
    url_list = soup_object.select("#video-title")
    print(url_list)
    # this is not working too?!
    div_content = soup_object.find("div", attrs={"class" : "content"})
    print(div_content)

Also, I run it via below command:

python3 test.py https://www.youtube.com/playlist\?list\=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

My output is None when printing the BeautifulSoup object after either select or fenter code hereind methods. Shouldn't it find anything meaningful because the id is present in the page?

selector gadget shows me #video-title only when clicking on that section, even I could not access the div how should I extract link and link's name?

Barmar
  • 741,623
  • 53
  • 500
  • 612
amkyp
  • 107
  • 1
  • 9
  • There's no `id="video-title"` anywhere in that page. There are lots of `class="video-title"`. It sounds like this ID is something being added by the extension when you click on it, but how can BS know which item you want? – Barmar Sep 02 '19 at 19:10
  • Use `.video-title` to select by class. – Barmar Sep 02 '19 at 19:22
  • @Barmar thanks, for hellping, but I've tried `url_list = soup_object.select(".video-title")` it agian return nothing ([ ]) ` – amkyp Sep 02 '19 at 20:30
  • The page returned by `requests.get()` is different from what the browser gets. YouTube is apparently checking the user agent. See https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python for how to customize the user agent. – Barmar Sep 02 '19 at 20:41

1 Answers1

1

YouTube checks the user agent to determine what kind of page to return. If you send the user agent corresponding to a real browser, you'll get the response you expect. video-title is a class, not an ID, so change the selector to .video-title.

import pprint
from bs4 import BeautifulSoup
import requests

pp = pprint.PrettyPrinter()

def get_links_from_youtube_playlist(youtube_playlist_url):

    request_response = requests.get(youtube_playlist_url, headers={"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"})

    soup_object = BeautifulSoup(request_response.text, 'lxml')
    url_list = soup_object.select(".video-title")
    pp.pprint(url_list)

get_links_from_youtube_playlist('https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab')

Output:

[<div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>,
 <div class="video-title text-shell skeleton-bg-color"></div>]
Barmar
  • 741,623
  • 53
  • 500
  • 612