Scraping a rendered javascript webpage

Question

I'm trying to build a short Python program that extracts Pewdiepie's number of subscribers which is updated every second on socialblade to show it in the terminal. I want this data like every 30 seconds.

I've tried using PyQt but it's slow, i've turned to dryscrape, slightly faster but doesn't work either as I want it to. I've just found Invader and written some short code that still has the same problem : the number returned is the one before the Javascript on the page is executed :

from invader import Invader

url = 'https://socialblade.com/youtube/user/pewdiepie/realtime'
invader = Invader(url, js=True)

subscribers = invader.take(['#rawCount', 'text'])
print(subscribers.text)

I know that this data is accessible via the site's API but it's not always working, sometimes it just redirect to this.

Is there a way to get this number after the Javascript on the page modified the counter and not before ? And which method seems the best to you ? Extract it :

from the original page which always returns the same number for hours ?
from the API's page which bugs when not using cookies in the code and after a certain amount of time ?

Thanks for your advices !

score 0 · Answer 1 · answered Aug 31 '20 at 19:48

If you want scrape a web page that has parts of it loaded in by javascript you pretty much need to use a real browser.

In python this can be achieved with pyppeteer:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=False)
    page = await browser.newPage()
    await page.goto('https://socialblade.com/youtube/user/pewdiepie/realtime',{
        'waitUntil': 'networkidle0'
    })
    count = int(await page.Jeval('#rawCount', 'e => e.innerText'))
    print(count)

asyncio.get_event_loop().run_until_complete(main())

Note: It does not seems like the website you mentioned above is updating the subscriber count frequently any more (even with JavaScript). See: https://socialblade.com/blog/abbreviated-subscriber-counts-on-youtube/

For best success and reliability you will probably need to set the user agent(page.setUserAgent in pyppeteer) and keep it up to date and use proxies (so your ip does not get banned). This can be a lot of work.

It might be easier and cheaper (in time and than buying a large pool of proxies) to use a service that will handle this for you like Scraper's Proxy. It supports will use a real browser and return the resulting html after the JavaScript has run and route all of our requests through a large network of proxies, so you can sent a lot of requests without getting you ip banned.

Here is an example using the Scraper's Proxy API getting the count directly from YouTube:

import requests
from pyquery import PyQuery

# Send request to API
url = "https://scrapers-proxy2.p.rapidapi.com/javascript"

params = {
    "click_selector": '#subscriber-count', # (Wait for selector work-around)
    "wait_ajax": 'true',
    "url":"https://www.youtube.com/user/PewDiePie"
}

headers = {
    'x-rapidapi-host': "scrapers-proxy2.p.rapidapi.com",
    'x-rapidapi-key': "<INSERT YOUR KEY HERE>" # TODO
}

response = requests.request("GET", url, headers=headers, params=params)

# Query html
pq = PyQuery(response.text)
count_text = pq('#subscriber-count').text()

# Extract count from text
clean_count_text = count_text.split(' ')[0]
clean_count_text = clean_count_text.replace('K','000')
clean_count_text = clean_count_text.replace('M','000000')
count = int(clean_count_text)

print(count)

I know this is a bit late, but I hope this helps

Scraping a rendered javascript webpage

1 Answers1