1

I'm trying to download PDF files automatically through VBA, I already managed to automatically navigate to many URLs and extract text from the html via the CSS Selector method querySelectorAll() and save it into my Excel spreadsheet through VBA.

I also can click javascript buttons and I know how to download PDF files in general but it doesn't seem to work with the PDFs from the website I am working on. It looks like the PDF files don't exist on the servers but only exist as a BLOB (e.g. blob:null/7cea2352-704e-42e2-9da7-2b65082134bb ) and get converted to PDFs on the fly through some javascript code when I click manually on the "download PDF" button in the firefox built-in PDF preview window.

Is there a way on how to access those BLOB files through vba and convert them to files to download them automatically like normal PDF files? I was looking through several tutorials/answered questions but they never showed how to do this with vba but always through javascript only.

My Code so far (the relevant part):

Sub vbaCrawler()

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True

IE.Navigate "websiteURL.com"

While IE.Busy Or IE.readyState < 4: DoEvents: Wend

t = Timer
counterX = 1
counterY = 1

Do
    DoEvents
    On Error Resume Next
    Set aNodeList = IE.document.querySelectorAll("#productPartSearchResult td")
    On Error GoTo 0
    If Timer - t = 10 Then Exit Do

    Loop While aNodeList Is Nothing
        If Not aNodeList Is Nothing Then
            For j = 18 To aNodeList.Length - 1
                Worksheets("CurrentStep").Cells(counterY, counterX).Value = aNodeList.Item(j).innerText

                If counterX < 9 Then
                counterX = counterX + 1
                Else
                counterX = 1
                counterY = counterY + 1
                End If
            Next j
        End If
//[...] bunch of code to format the text data

IE.Quit
Set IE = Nothing

End Sub

I could point excel to the canvas holding the BLOB file but I don't know how to go from there to make excel understand this canvas is actually holding a file that should be downloaded:

(Screenshot) the big white picture is the BLOB/PDF I am trying to download, the corresponding canvas is marked blue on the right side

here is a screenshot showing what happens if I rightclick the image that I am trying to download to view it in the browser:

screenshot image blob url

I was hoping for a filepath url so I can download that image, but it is just showing the blob url without any .png or .pdf extension which makes it hard for me to work with it. How can I download this file if it doesn't show me the filepath but only this blob url?

And how can I get to this blob url through VBA? right now I only know how to get to it manually by rightclicking the image with my mouse, but I don't find the blob url inside the html source code.

MaxMüller
  • 43
  • 5
  • here is also the full sourcecode of the website 1x as a plain txt file and 1x saved as .htm -> https://www.dropbox.com/sh/ori456xg5paq6a2/AABi5ygeCMqkR7dZNmDUUzZFa?dl=0 – MaxMüller Sep 15 '18 at 12:16
  • Your selector doesn't appear to select anything for the HTML supplied. The id #productPartSearchResult doesn't appear to exist. Which elements are you after please? (HTML) The selector for the _blue_ element is .querySelector("[id='explodedViewDocument:currentDocument']") Also, list of pdf links with .querySelectorAll("[href*=pdf]") – QHarr Sep 15 '18 at 14:15
  • yes I am after the blue element which would be selected with ".querySelector("[id='explodedViewDocument:currentDocument']")", my problem is what do I have to do next to download this element? behind this blue slected canvas element is an image, this is the image I want to download, problem is there is no .png or .pdf file to represent this image only a BLOB url appears when I rightclick the image to view the image: blob:null/7cea2352-704e-42e2-9da7-2b65082134bb – MaxMüller Sep 15 '18 at 15:53
  • yes the url is behind a loginmask... I am sorry I should have differentiate it better: the selector id #productPartSearchResult in the current code already is working perfectly fine for another url / subpage of the same Website. Now I want to download the big white image on the right side of the screenshot which is represented by the blue canvas element. the provided html is the content of the url / subpage that I am still working on. I didn't provide the html for the already working querySelector id #productPartSearchResult since that would be another url and its perfectly working – MaxMüller Sep 15 '18 at 15:59
  • the binary download sounds exactly as what I am trying to accomplish but don't know how to start. I cannot find any url like "website.com/folder/picture23.png" to download it the normal way, the only url I can find is something like "blob:null/7cea2352-704e-42e2-9da7-2b65082134bb"... do you know how to convert that url to a normal filepath url? or how to download a file from a strange url like that: "blob:null/7cea2352-704e-42e2-9da7-2b65082134bb"? – MaxMüller Sep 15 '18 at 16:18
  • the docs on the left are a small thumbnail in a very bad, unreadable quality. Those thumbnails are actually normal .png files which would be easy for me to download. But when I click the small thumbnails they show the big picture on the right which suddenly isn't a .png file but a blob file: "blob:null/7cea2352-704e-42e2-9da7-2b65082134bb" – MaxMüller Sep 15 '18 at 16:20
  • No I don't. If you inspect with fiddler or dev tools when manually downloading can you see anything in the web traffic that might help? But there are pdf links in those elements on the left. See my querySelectorAll above. No idea if they are useful. [Binary download](https://stackoverflow.com/a/52245239/6241235) – QHarr Sep 15 '18 at 16:21
  • I just checked all the pdf urls from your selector, sadly they all are irrelevant pdf-urls hidden in the upper middle "DOKUMENTE" Tab (under the white button "WARENKORB" from the screenshot). I also tried to download the png using your binary download method but he doesn't accept the url "blob:null/7cea2352-704e-42e2-9da7-2b65082134bb" responding with a error "the url doesn't use any known protocol" – MaxMüller Sep 15 '18 at 17:21
  • would it be possible to embed [this blobToFile javascript Method](https://stackoverflow.com/a/29390393/10049408) into vba and call it? It seems to do exactly what I want which is converting this strange blob url "blob:null/7cea2352-704e-42e2-9da7-2b65082134bb" into a file like ".../mypicture.png" so I can download it using your binary download method. Only problem I have is: this is javascript, is there a way to embed it in VBA and use it? – MaxMüller Sep 15 '18 at 17:21
  • 32 bit there are ways to execute javascript with script control. There are also 23/64 bit parentWindow.execScript methods. – QHarr Sep 15 '18 at 17:23
  • nothing showing or popping up in the network tab under firefox devtools when downloading manually. – MaxMüller Sep 15 '18 at 17:27

0 Answers0