2

I'm trying to scrape something from a site using python. For example the views on this video (the url) it always returns "None". What am I doing wrong? here is the code:

from bs4 import BeautifulSoup
import requests

url = 'https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')
views = soup.body.find(class_='view-count style-scope ytd-video-view-count-renderer')
print(views)

Thanks! (btw when I try the code shown in the video it works fine)

Sven
  • 49
  • 6
  • Does this answer your question? [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – ggorlen Jul 05 '21 at 22:31

2 Answers2

1

The page is loaded dynamically, requests doesn't support dynamically loaded pages. However, the data is available in JSON format, you can use the re/json modules to get the correct data.

For example, to get the "view count":

import re
import json
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/watch?v=1OfK8UmLMl0&ab_channel=HitraNtheUnnecessaryProgrammer"
soup = BeautifulSoup(requests.get(url).text, "html.parser")

# We locate the JSON data using a regular-expression pattern
data = re.search(r"var ytInitialData = ({.*?});", soup).group(1)
data = json.loads(data)

print(
    data["contents"]["twoColumnWatchNextResults"]["results"]["results"]["contents"][0][
        "videoPrimaryInfoRenderer"
    ]["viewCount"]["videoViewCountRenderer"]["viewCount"]["simpleText"]
)

Output:

124 views

The variable data contains all the data in a Python dictionary (dict) to print all the data you can use:

print(json.dumps(data, indent=4))

Output (truncated):

{
    "responseContext": {
        "serviceTrackingParams": [
            {
                "service": "CSI",
                "params": [
                    {
                        "key": "c",
                        "value": "WEB"
                    },
                    {
                        "key": "cver",
                        "value": "2.20210701.07.00"
                    },
                    {
                        "key": "yt_li",
                        "value": "0"
                    },
                    {
                        "key": "GetWatchNext_rid",
                        "value": "0x1d62a299beac9e1f"
                    }
                ]
            },
            {
                "service": "GFEEDBACK",
                "params": [
                    {
                        "key": "logged_in",
                        "value": "0"
                    },
                    {
                        "key": "e",
                        "value": "24037443,24058293,24058128,24003103,24042870,23882685,24023960,23944779,24027649,24046896,24059898,24049577,23983296,23966208,24056265,23891346,1714258,24049575,24045412,24003105,23999405,24051884,23891344,23986022,24049573,24056839,24053866,24058240,23744176,23998056,24010336,24037586,23934970,23974595,23735348,23857950,24036947,24051353,24038425,23990875,24052245,24063702,24058380,23983813,24058812,24026834,23996830,23946420,24001373,24049820,24030040,24062848,23968386,24027689,24004644,23804281,24049569,23973490,24044110,23884386,24012512,24044124,24059521,23918597,24007246,24049567,24022729,24037794"
                    }
                ]
            },
            {
                "service": "GUIDED_HELP",
                "params": [
                    {
                        "key": "logged_in",
                        "value": "0"
                    }
                ]
            },
            {
                "service": "ECATCHER",
                "params": [
                    {
                        "key": "client.version",
                        "value": "2.20210701"
                    },
                    {
                        "key": "client.name",
                        "value": "WEB"
                    }
                ]
            }
        ],
        "mainAppWebResponseContext": {
            "loggedOut": true
        },
        "webResponseContextExtensionData": {
            "ytConfigData": {
                "visitorData": "CgtoanprT1pPbmtWTSjYk46HBg%3D%3D",
                "rootVisualElementType": 3832
            },
MendelG
  • 14,885
  • 4
  • 25
  • 52
  • Just a heads up that this code no longer works (as YouTube is constantly changing their site). The `re.search` line gives `TypeError: expected string or bytes-like object`. You can fix it (for now) by changing `data = re.search(r"var ytInitialData = ({.*?});", soup).group(1)` to `data = re.search(r"var ytInitialData = ({.*?});", soup.prettify()).group(1)` – Matt Popovich Jan 04 '22 at 01:41
0

I usually try to view the API requests (from the network tab on dev tools) when a site is dynamically loaded. I was successful with sites such as udemy, skillshare and few others but not with youtube. so in such case, I would use the youtube official API. which is quite easy to use and have plenty of code samples on github. with that you just request your data and get a json response. that you can convert to a dictionary with response.json(). or another option would be using selenium which is not a solution I like and it's pretty resource and time consuming. requesting from API is faster than scraping or any other solution on earth. when something doesn't provide an API, you need scraping

Hyperx837
  • 773
  • 5
  • 13