12

When using web.whatsapp.de one can see that the link to a recieved image may look like this:

blob:https://web.whatsapp.com/3565e574-b363-4aca-85cd-2d84aa715c39

If the link is copied in to an address window it will open up the image, however - if "blob" is left out - it will simply open a new web whatsapp window.

I am trying to download the image displayed by this link.

But using common techniques such as using request, or urllib.request or even BeautifulSoup always struggle at one point: The "blob" at the beginning of the url will throw an error.

These answers Download file from Blob URL with Python will trhow either the Error

URLError: <urlopen error unknown url type: blob>

or the Error

InvalidSchema: No connection adapters were found for 'blob:https://web.whatsapp.com/f50eac63-6a7f-48a4-a2b8-8558a9ffe015'

(using BeatufilSoup)

Using a native approach like:

import requests

url = 'https://web.whatsapp.com/f50eac63-6a7f-48a4-a2b8-8558a9ffe015'
fileName = 'test.png'
req = requests.get(url)
file = open(fileName, 'wb')
for chunk in req.iter_content(100000):
    file.write(chunk)
file.close()

Will simply result in the same error as using BeautifulSoup.

I am controlling Chrome using Selenium in Python, however I was unable to download the image correctly using the provided link.

jozxyqk
  • 16,424
  • 12
  • 91
  • 180
Kev1n91
  • 3,553
  • 8
  • 46
  • 96
  • Could you please include the relevant HTML source of the img you are trying to scrape? – Bin Ury Nov 22 '17 at 00:04
  • web.whatsapp.com , the url links from an image will differ from user to user, so I am not able to provide an exemplary link – Kev1n91 Nov 22 '17 at 00:08
  • When previewing shared images on that page, a download button appears in the corner. You could try triggering that button with a mouse click in Selenium which should prompt the browser to download the blob resource. Some configuration to permit automatic downloads may be required according to the link I shared below. – Bin Ury Nov 22 '17 at 00:15

3 Answers3

16

A blob is a filelike object of raw data stored by the browser.

You can see them at chrome://blob-internals/

It's possible to get the content of a blob with Selenium with a script injection. However, you'll have to comply to the cross origin policy by running the script on the page/domain that created the blob:

def get_file_content_chrome(driver, uri):
  result = driver.execute_async_script("""
    var uri = arguments[0];
    var callback = arguments[1];
    var toBase64 = function(buffer){for(var r,n=new Uint8Array(buffer),t=n.length,a=new Uint8Array(4*Math.ceil(t/3)),i=new Uint8Array(64),o=0,c=0;64>c;++c)i[c]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".charCodeAt(c);for(c=0;t-t%3>c;c+=3,o+=4)r=n[c]<<16|n[c+1]<<8|n[c+2],a[o]=i[r>>18],a[o+1]=i[r>>12&63],a[o+2]=i[r>>6&63],a[o+3]=i[63&r];return t%3===1?(r=n[t-1],a[o]=i[r>>2],a[o+1]=i[r<<4&63],a[o+2]=61,a[o+3]=61):t%3===2&&(r=(n[t-2]<<8)+n[t-1],a[o]=i[r>>10],a[o+1]=i[r>>4&63],a[o+2]=i[r<<2&63],a[o+3]=61),new TextDecoder("ascii").decode(a)};
    var xhr = new XMLHttpRequest();
    xhr.responseType = 'arraybuffer';
    xhr.onload = function(){ callback(toBase64(xhr.response)) };
    xhr.onerror = function(){ callback(xhr.status) };
    xhr.open('GET', uri);
    xhr.send();
    """, uri)
  if type(result) == int :
    raise Exception("Request failed with status %s" % result)
  return base64.b64decode(result)

bytes = get_file_content_chrome(driver, "blob:https://developer.mozilla.org/7f9557f4-d8c8-4353-9752-5a49e85058f5")
Florent B.
  • 41,537
  • 7
  • 86
  • 101
  • 1
    How to save this content to file or play the audio? – Rodrigo Vieira Dec 17 '20 at 19:21
  • 1
    What is the argument driver referring to? Could you give us an example? – epsimatic88 May 17 '22 at 15:16
  • This isn't working for me. The result I keep getting back is 0. Is this solution still functioning? – Phil Oct 25 '22 at 14:38
  • CORRECTION: I'm able to retrieve a legitimate base64 string after visiting the blob url FIRST prior to calling this method. I was scraping the blob url then calling `evaluate_async_script` (Ruby) but this only works when visiting the blob url via browser first. – Phil Oct 25 '22 at 15:23
4

Blobs are not actual files to be remotely retrieved by a URI. Instead, they are programatically generated psuedo-URLs which are mapped to binary data in order to give the browser something to reference. I.e. there is no attribute of <img> to provide raw data so you instead create a blob address to map that data to the standard src attribute.

From the MDN page linked above:

The only way to read content from a Blob is to use a FileReader. The following code reads the content of a Blob as a typed array.

var reader = new FileReader();
reader.addEventListener("loadend", function() {
   // reader.result contains the contents of blob as a typed array
});
reader.readAsArrayBuffer(blob);
Bin Ury
  • 645
  • 7
  • 20
  • 1
    Thank you for your insight, I am fairly new to javascript - could you please tell me what I have to fit in as "blob" ? The "link", I have? – Kev1n91 Nov 21 '17 at 23:27
  • 1
    If I inove the command readAsDataURL("blob:https://web.whatsapp.com/3565e574-b363-4aca-85cd-2d84aa715c39"), I get the error that the argument is not blob type. The examples are great, but still it is undefined when used. Have you tried it with an examplary link from web whatsapp? – Kev1n91 Nov 21 '17 at 23:38
  • It's looking for an actual Blob object rather than a URL. From what I can tell downloading files with a headless browser requires some type of workaround to begin with (see: https://blog.codecentric.de/en/2010/07/file-downloads-with-selenium-mission-impossible) as Javascript (for security) does not provide a mechanism to automatically download a file. It is unclear to me whether or not it is possible to programmatically download a blob resource as I have not yet come across any examples of this headless browser usage. – Bin Ury Nov 21 '17 at 23:59
2

For people who are trying to do the same in node and selenium, please refer below.

var script = function (blobUrl) {
    console.log(arguments);
    var uri = arguments[0];
    var callback = arguments[arguments.length - 1];
    var toBase64 = function(buffer) {
        for(var r,n=new Uint8Array(buffer),t=n.length,a=new Uint8Array(4*Math.ceil(t/3)),i=new Uint8Array(64),o=0,c=0;64>c;++c)
            i[c]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".charCodeAt(c);for(c=0;t-t%3>c;c+=3,o+=4)r=n[c]<<16|n[c+1]<<8|n[c+2],a[o]=i[r>>18],a[o+1]=i[r>>12&63],a[o+2]=i[r>>6&63],a[o+3]=i[63&r];return t%3===1?(r=n[t-1],a[o]=i[r>>2],a[o+1]=i[r<<4&63],a[o+2]=61,a[o+3]=61):t%3===2&&(r=(n[t-2]<<8)+n[t-1],a[o]=i[r>>10],a[o+1]=i[r>>4&63],a[o+2]=i[r<<2&63],a[o+3]=61),new TextDecoder("ascii").decode(a)
    };
    var xhr = new XMLHttpRequest();
    xhr.responseType = 'arraybuffer';
    xhr.onload = function(){ callback(toBase64(xhr.response)) };
    xhr.onerror = function(){ callback(xhr.status) };
    xhr.open('GET', uri);
    xhr.send();
}
driver.executeAsyncScript(script, imgEleSrc).then((result) => {
    console.log(result);
})

For detailed explanation, please refer below link https://medium.com/@anoop.goudar/how-to-get-data-from-blob-url-to-node-js-server-using-selenium-88b1ad57e36d

AnoopGoudar
  • 914
  • 9
  • 18