java.net.URL class throwing MalformedException because of unknown protocol: blob

Question

I'm automating my test scenario for validation of a pdf document. This document opens in a new browser tab once clicked on the document link(anchor tag). I want to validate a few important contents in a document for which I'm using Apache PDFBox. But, the document URL has a prefix 'blob' because of which, java.net.URL class is throwing MalformedException for unknown protocol: blob. how should I define/add that protocol in java?

Please let me know how to get rid of this error so that I can successfully use PDFBox to parse my pdf file.

Java version - 1.8

This is the screenshot of pdf document after it opens in a browser.

This is HTML source of document. But, as it's a pdf view, cannot perform any operations such as fetching text/windowTitle etc.

following is a sample code snippet -

public void readPdfContents() throws IOException {

    String url = "blob:https://cpswebqa.testcbidata.com/f9ad63bc-700e-4f49-a4fb-807ad1a44b01";
    URL pdfUrl = new URL(url);
    InputStream ips = pdfUrl.openStream();
    BufferedInputStream bis = new BufferedInputStream(ips);
    PDFParser pdfParser = new PDFParser(bis);
    pdfParser.parse();
    String pdfData = new PDFTextStripper().getText(pdfParser.getPDDocument());

    System.out.println("PDF Data is - " + pdfData);

}

Error stack trace -

Exception in thread "main" java.net.MalformedURLException: unknown protocol: blob
    at java.net.URL.<init>(URL.java:600)
    at java.net.URL.<init>(URL.java:490)
    at java.net.URL.<init>(URL.java:439)
    at com.cbsh.automation.file.testrunner.WEB.Sample.main(Sample.java:11)

Update the question with the relevant HTML, code trials and complete error stack trace. — undetected Selenium, Dec 12 '19 at 23:03
@DebanjanB Added sample code, screenshots and error stack. please let me know if it's useful — Shantanu, Dec 12 '19 at 23:30
@VGR - I tried it before. But, if I remove the prefix, it won't consider it a valid url for required document. i.e. I'm not able to access document without prefix and it's throwing IOException saying End Of file. — Shantanu, Dec 13 '19 at 17:38
Then it means the file is empty there, or maybe this URL is just a key into some database? Can you access these URLs from the browser? — Tilman Hausherr, Dec 16 '19 at 09:51
@TilmanHausherr Yes. I can access URL in the browser as it is. but, If I remove prefix, then pdf isn't displaying in browser - making it invalid url. — Shantanu, Dec 16 '19 at 14:30
So I searched for this "blob URL" thing myself and found this: https://superuser.com/a/1109873/389820 and https://stackoverflow.com/a/30881444/535646 This is new to me too, my understanding after reading these answers is that this URL is generated by the browser (probably javascript?) and only exists there. — Tilman Hausherr, Dec 16 '19 at 14:42
@TilmanHausherr Thank you for your research. And yes. I agree that these URLs are generated by browser internally (mostly JS) and so those are specific to that browser instance. But don't know if we can automate those/extract text from those. — Shantanu, Dec 17 '19 at 18:24

score 1 · Answer 1 · answered Feb 27 '20 at 22:11

I got the same problem and found a solution injecting Javascript like in here:

How to download an image with Python 3/Selenium if the URL begins with “blob:”?

I wrote in Java and it worked very well, here is the code:

 private String getBytesBase64FromBlobURI(ChromeDriver driver, String uri) {
    String script = " "
            + "var uri = arguments[0];"
            + "var callback = arguments[1];"
            + "var toBase64 = function(buffer){for(var r,n=new Uint8Array(buffer),t=n.length,a=new Uint8Array(4*Math.ceil(t/3)),i=new Uint8Array(64),o=0,c=0;64>c;++c)i[c]='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'.charCodeAt(c);for(c=0;t-t%3>c;c+=3,o+=4)r=n[c]<<16|n[c+1]<<8|n[c+2],a[o]=i[r>>18],a[o+1]=i[r>>12&63],a[o+2]=i[r>>6&63],a[o+3]=i[63&r];return t%3===1?(r=n[t-1],a[o]=i[r>>2],a[o+1]=i[r<<4&63],a[o+2]=61,a[o+3]=61):t%3===2&&(r=(n[t-2]<<8)+n[t-1],a[o]=i[r>>10],a[o+1]=i[r>>4&63],a[o+2]=i[r<<2&63],a[o+3]=61),new TextDecoder('ascii').decode(a)};"
            + "var xhr = new XMLHttpRequest();"
            + "xhr.responseType = 'arraybuffer';"
            + "xhr.onload = function(){ callback(toBase64(xhr.response)) };"
            + "xhr.onerror = function(){ callback(xhr.status) };"
            + "xhr.open('GET','"+ uri +"');"
            + "xhr.send();";
    String result = (String) driver.executeAsyncScript(script, uri);
    return result;
}

I hope it help someone.

Cheers!

score 0 · Answer 2 · answered Mar 10 '23 at 11:49

If the image URL begins with "data", it means that the image data is embedded in the HTML page itself, rather than being stored on a remote server that can be accessed via a URL. Therefore, you cannot download the image using a standard HTTP connection. So, the base64 mechanism helps us.

Image Source URL : data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAWAUNoinFRBWASIUA........AAAAElFTkSuQmCC

To download the image, the below code can be used:

// Get the image source data
String imageData = webElement.getAttribute("src");

// Extract the image data and file extension from the data URL
String[] parts = imageData.split(",");
String mimeType = parts[0].split(":")[1];
String base64Data = parts[1];
String fileExtension = "";

if (mimeType.equals("image/jpeg")) {
    fileExtension = ".jpg";
} else if (mimeType.equals("image/png")) {
    fileExtension = ".png";
} else if (mimeType.equals("image/gif")) {
    fileExtension = ".gif";
} else {
    // Unsupported image format
    throw new IOException("Unsupported image format");
}

// Set the output file path and stream. Here, we save the image file.
String outputPath = "C:/images/image" + fileExtension;
FileOutputStream outputStream = new FileOutputStream(outputPath);

// Close the output stream
outputStream.close();

This code first extracts the image data from the "data" URL and splits it into its MIME type and base64-encoded data components. It then determines the file extension based on the MIME type and saves the image to a file on disk, after decoding the base64-encoded image data. Note that you will need to handle any exceptions that may occur during the decoding and file I/O processes.

To use this code, you will need to import the following classes in addition to the ones I mentioned in my previous answer:

import java.io.File;
import java.util.Base64;

The java.util.Base64 class is used to decode the base64-encoded image data. The java.io.File class is used to represent the output file on disk.

java.net.URL class throwing MalformedException because of unknown protocol: blob

2 Answers2