VBA Webscrape URL from HTML (src="")

Question

I tried to combine code parts i could make work, but it was working with <span>, <meta> but it is not working with <img> Can anyone help to make it work?

I try to get: https://www.lego.com/cdn/cs/set/assets/blt34360a0ffaff7811/11015_alt.png?fit=bounds&format=png&width=800&height=800&dpr=1 From this code:

<img src="https://www.lego.com/cdn/cs/set/assets/blt34360a0ffaff7811/11015_alt.png?fit=bounds&amp;format=png&amp;width=800&amp;height=800&amp;dpr=1" alt="" class="Imagestyles__Img-sc-1qqdbhr-0 cajeby">

Code part where i want to get the Src url

Sub picgrab()

  Dim Doc As Object  
  Dim nodeAllPic As Object
  Dim nodeOnePic As Object
  Dim pic As Object

  Set Doc = CreateObject("htmlFile")

  With CreateObject("MSXML2.XMLHTTP.6.0")
  
    url = "https://www.lego.com/hu-hu/product/around-the-world-11015"
    .Open "GET", url, False
    .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0"
    .send
' It is important that i can't use InternetExplorer.

'This should work i guess, but it skips after 'For Each' line.
    Set nodeAllPic = Doc.getElementsByClassName("Imagestyles__Img-sc-1qqdbhr-0 cajeby")

    For Each nodeOnePic In nodeAllPic
        If nodeOnePic.getAttribute("class") = "Imagestyles__Img-sc-1qqdbhr-0 cajeby" Then
           Set pic = nodeOneVip.getElementsByClassName("Imagestyles__Img-sc-1qqdbhr-0 cajeby")(0)
           ActiveCell.Value = pic.getAttribute("src")
        
        End If
    Next nodeOnePic
  
  End With
  
End Sub

I tired the code above and modified it many way, but couldn't get the content of Src="" .

If you only need the first image, you can also read the URL from a meta tag ` — Zwenn, Jan 18 '23 at 22:31
@Zwenn Thank you, that was a good suggestion. After the other example I could easily implement it into my program and it worked great :) . — hwka, Jan 19 '23 at 00:08
@Zwenn It would be interesting to know which method (selenium or your's) would be faster for the program as I have about 700 links to go through automatically, and for each lego set, I will collect datas from other sites as well, not just lego.com. — hwka, Jan 19 '23 at 00:08
@Zwenn You are right, I need to practice a lot to get more and more practise in this area of web scraping. Other sites may present new challenges that may stuck me and I'll try to find a solution myself, but thanks for making this stackoverflow very useful. I guess i will have other questions :) — hwka, Jan 19 '23 at 00:12

DecimalTurn · Answer 1 · 2023-01-19T02:01:14.210

Need to write the response

First of all, you never write the HTML response to your htmlfile object. So you won't be able to find anything when you call the method getElementsByClassName on it.

Make sure that you include the following line before trying to use the Doc object:

Doc.Write .responseText

Dynamic Content

Secondly, some of the content on that page is not in the original HTTP request that XMLHTTP receives. The page contains JavaScript code that loads content dynamically.

To test this in Chrome, you can open the Chrome DevTools window on that page, then disable JavaScript and refresh the page.

You'll then see the original HTML and a notification that says that JavaScript is disabled.

And now, if you search inside the Elements tab, you won't find the element you were looking for (at least I couldn't find anything with a class "cajeby").

Browser emulation

So, now what? Well, you'll need to use an object to manipulate the original response to execute the JavaScript code. For that you could use Selenium. It's the modern way of doing web scraping or any browser automation with VBA.

You can easily find tutorials on how to get started with Selenium for VBA, but I would recommend this video by WiseOwlTutorials.

Then your code could look like this:

    Dim Browser As New Selenium.WebDriver
    Browser.Start "chrome", "https://www.lego.com/hu-hu/product/around-the-world-11015"
    Browser.Get "/"
    
    Dim img As WebElement
    Set img = Browser.FindElementByCss(".Imagestyles__Img-sc-1qqdbhr-0.cajeby", timeout:=5000)
    
    Debug.Print img.Attribute("src")
    
    Set Browser = Nothing

Some notes on the code

Make sure that you have included a reference to the Selenium Library
Notice the use of FindElementByCss. This is necessary because you are using 2 class names and no other method currently support that, but you'll need to use the CSS selector syntax. (More about this here).
Notice the use of timeout:=5000 that lets Selenium know that you are willing to wait up to 5000 milliseconds for the JavaScript code to load the content you are looking for (More details here).

Thanks for the quick and detailed reply! This selenium is an interesting topic. I can see that if I want to collect data from many different sites, it might be easier to use in the long run. I will definitely check out the tutorial videos. Thank you! — hwka, Jan 19 '23 at 00:14

VBA Webscrape URL from HTML (src="")

1 Answers1

Need to write the response

Dynamic Content

Browser emulation

Some notes on the code