I tried getting the content of the blob stored in the memory using Selenium in Python with script injection.
Here is the code:
from selenium import webdriver
import base64
from bs4 import BeautifulSoup
def download_blob(driver, uri):
result = driver.execute_async_script("""
var uri = arguments[0];
var callback = arguments[arguments.length-1];
var toBase64 = function(buffer){for(var r,n=new Uint8Array(buffer),t=n.length,a=new Uint8Array(4*Math.ceil(t/3)),i=new Uint8Array(64),o=0,c=0;64>c;++c)i[c]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".charCodeAt(c);for(c=0;t-t%3>c;c+=3,o+=4)r=n[c]<<16|n[c+1]<<8|n[c+2],a[o]=i[r>>18],a[o+1]=i[r>>12&63],a[o+2]=i[r>>6&63],a[o+3]=i[63&r];return t%3===1?(r=n[t-1],a[o]=i[r>>2],a[o+1]=i[r<<4&63],a[o+2]=61,a[o+3]=61):t%3===2&&(r=(n[t-2]<<8)+n[t-1],a[o]=i[r>>10],a[o+1]=i[r>>4&63],a[o+2]=i[r<<2&63],a[o+3]=61),new TextDecoder("ascii").decode(a)};
var xhr = new XMLHttpRequest();
xhr.responseType = 'arraybuffer';
xhr.onload = function(){ callback(toBase64(xhr.response)) };
xhr.onerror = function(){ callback(xhr.status) };
xhr.open('GET', uri);
xhr.send();
""", uri)
print(uri, result)
if type(result) == int :
raise Exception("Request failed with status %s" % result)
return base64.b64decode(result)
options = webdriver.ChromeOptions()
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36')
driver = webdriver.Chrome(options=options)
url = 'https://www.youtube.com/watch?v=KBtk5FUeJbk'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html5lib')
blob_url = soup.find('video', attrs={'class': 'video-stream html5-main-video'})['src']
byte_stream = download_blob(driver, blob_url)
Output:
blob:https://www.youtube.com/5e3f1fab-3839-45a1-bb62-3582635b9e7d 0
Traceback (most recent call last):
File "C:\Users\*****\Desktop\blob-download.py", line 32, in <module>
byte_stream = download_blob(driver, blob_url)
File "C:\Users\*****\Desktop\blob-download.py", line 20, in download_blob
raise Exception("Request failed with status %s" % result)
Exception: Request failed with status 0
The result
variable returns an integer 0, stating that the request has failed.
I am not getting what is going wrong. At least some part of the blob which is in memory should be displayed as bytes.
I took the above code as a reference from How to download an image with Python 3/Selenium if the URL begins with "blob:"?.
The answer mentioned that I needed to grab that blob url from the page that created that blob, hence, I am scraping the blob url using BeautifulSoup
and not hard-coding the blob url.
Example:
byte_stream = download_blob(driver, 'blob:https://www.youtube.com/5e3f1fab-3839-45a1-bb62-3582635b9e7d') # this would definitely not work
I even tried changing the websites, as I thought maybe YouTube would have some strict policy regarding scraping content, but still no luck. All the other websites gave the same response.
An insight on some JavaScript alternative is also welcome.