0

I am trying to scrape my employers website to extract images from their Blog post en mass. I have started creating a scraping tool in Excel using VBA.

(We don't have access to the SQL database)

I have setup a work sheet that contains a list of post identifiers in column A and the URL of the post in column B.

My VBA script so far runs through the list of URL's in column B extracts the HTML from a Tag on the page by ID, using getElementById and pastes the resulting output as a string into column C.

I am now at the point where I am trying to figure out how to extract the src attribute from every image in the resulting output and paste it into the relevant columns. I can't for the life of me come up with an easy solution. I am not very familiar with RegEx and am struggling with Excel's built in string functions.

The end game is to get the macro to run through each image URL and save the image to disk with a filename format like "{Event No.}-{Image Number}".jpg

Any help would be much appreciated.

Worksheet setup

Sub Get_Image_SRC()

Dim sht As Worksheet
Dim LastRow As Long
Dim i As Integer
Dim url As String
Dim IE As Object
Dim objElement As Object
Dim objCollection As Object
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement


Set sht = ThisWorkbook.Worksheets("Sheet1")
'Ctrl + Shift + End
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
For i = 2 To LastRow
    url = Cells(i, "C").Value
    MsgBox (url)
    IE.navigate url
    Application.StatusBar = url & " is loading..."
    Do While IE.readyState = 4: DoEvents: Loop
    Do Until IE.readyState = 4: DoEvents: Loop
    Application.StatusBar = url & " Loaded"
    If Cells(i, "B").Value = "WEBNEWS" Then
        Cells(i, "D").Value = IE.document.getElementById("NewsDetail").outerHTML
       Else
        Cells(i, "D").Value = IE.document.getElementById("ReviewContainer").outerHTML
    End If



Next i

Set IE = Nothing
Set objElement = Nothing
Set objCollection = Nothing

End Sub

Example resulting HTML:

<div id=""NewsDetail""><div class=""NewsDetailTitle"">Video: Race Face Behind the Scenes Tour</div><div class=""NewsDetailImage""><img alt=""HeadlinesThumbnail.jpg"" src=""/ImageHandler/6190/515/1000/0/""></div>    <div class=""NewsDetailBody"">Pinkbike posted this video a while ago, if you missed it, its' definitely worth a watch. 

Ken from Camp of Champions took a look at their New Westminster factory last year which gives a look at the production, people and culture of Race Face. The staff at Race Face are truly their greatest asset they had, best wishes to everyone!

<p><center><object width=""500"" height=""281""><param name=""allowFullScreen"" value=""true""><param name=""AllowScriptAccess"" value=""always""><param name=""movie"" value=""http://www.pinkbike.com/v/188244""><embed width=""500"" height=""281"" src=""http://www.pinkbike.com/v/188244"" type=""application/x-shockwave-flash"" allowscriptaccess=""always"" allowfullscreen=""true""></object></center><p></p>


</div><div class=""NewsDate"">Published Friday, 25 November 2011</div></div>"

My current references

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343

2 Answers2

1

Using VBA seems very complicated when you can do this easily with Wget: How do I use Wget to download all Images into a single Folder.

Community
  • 1
  • 1
ThomasDoe
  • 113
  • 1
  • 5
  • I only require images from a specific section of each page and require the images to be labelled in such a way that they can be matched up to specific page and placement in the text. I am also on Windows with no access to a *nix box. – Roll_Dice_Get_Data Sep 29 '16 at 09:31
0

For the regular expression method you should check out these two links:

Which basically boils down to this:

  • Regular expression to get a src attribute value from img is src\s*=\s*"(.+?)"
  • Use the VBScript.RegExp library to use regular expressions in VBA

I've used late binding but you can include the reference if you want.

Then the VBA goes like this:

Option Explicit

Sub Test()

Dim strHtml As String

' sample html, note single img tag
strHtml = ""
strHtml = strHtml & "<div id=""foo"">"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""fred"" src=""\\server\path\picture1.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""ned"" src=""\\server\path\picture2.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""teddy"" src=""\\server\path\picture3.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "</div>"

Dim strSrc As String
Dim objRegex As Object
Dim objMatches As Object
Dim lngMatchCount As Long, lngCounter As Long

' create regex
Set objRegex = CreateObject("VBScript.RegExp")

' set pattern and execute
With objRegex
    .IgnoreCase = True
    .Pattern = "src\s*=\s*""(.+?)"""
    .Global = True

    If .Test(strHtml) Then
        Set objMatches = .Execute(strHtml)
        lngMatchCount = objMatches.Count
        For lngCounter = 0 To lngMatchCount - 1
            strSrc = objMatches(lngCounter).SubMatches(0)
            ' youve successfully captured the img src value
            Debug.Print strSrc
        Next
    Else
        strSrc = "Not found"
    End If
End With

End Sub

Note that I am getting the first item of the SubMatches collection in order to get the value of the src attribute. The difference between objMatches(0) and objMatches(0).SubMatches(0) in this code is:

src="\\server\path\picture.png"

Versus:

\\server\path\picture.png

You probably want to wrap this up as a function and call it when you get work out the value of IE.document.getElementById("NewsDetail").outerHTML in the If..End If block of your code.

Community
  • 1
  • 1
Robin Mackenzie
  • 18,801
  • 7
  • 38
  • 56