If you will consider a solution without using regular expressions then you can use the HTMLDocument
object.
You can add a reference (Microsoft HTML Object Library) in the VBE to get this library and then use early binding. Or, for my example code below, just use late binding with:
Dim objHtml As Object
Set objHtml = CreateObject("htmlfile")
My example passes a string to create the HTMLDocument
and you need to use late binding according this accepted answer.
Anyhow, you can then use methods and properties of the HTMLDocument
object to inspect the DOM - I've used getElementsByTagName
, innerText
and innerHTML
below to get the two tags you are interested in. E.g.:
' we want a tags without anchors and without img
For Each objElement In objElements
' innerText = "" is no anchor
If objElement.innerText = "" Then
' check for <img in innerHtml to avoid a tags with an image
If InStr(1, objElement.innerHtml, "<IMG", vbTextCompare) = 0 Then
Debug.Print objElement.outerHTML
End If
End If
Next objElement
Full example:
Option Explicit
Sub ParseATags()
Dim strHtml As String
strHtml = ""
strHtml = strHtml & "<html>"
strHtml = strHtml & "<body>"
' 2 without anchors and without <img>
strHtml = strHtml & "<a href=""/""><span style=""color: #000000;""></span></a>"
strHtml = strHtml & "<a href=""/""></a>"
' 2 without anchors and with <img>
strHtml = strHtml & "<a href=""/"" title=""""><span style=""color: #000000;""></span><img class=""cars""></a>"
strHtml = strHtml & "<a href=""/"" title=""""><img class=""cars""></a>"
' and 2 with anchors
strHtml = strHtml & "<a href=""/""><span style=""color: #000000;"">Cars</span></a><br>"
strHtml = strHtml & "<a href=""/"">Cars</a><br>"
strHtml = strHtml & "</body>"
strHtml = strHtml & "</html>"
' must use late binding
' https://stackoverflow.com/questions/9995257/mshtml-createdocumentfromstring-instead-of-createdocumentfromurl
Dim objHtml As Object
Set objHtml = CreateObject("htmlfile")
' add html
With objHtml
.Open
.write strHtml
.Close
End With
' now parse the document
Dim objElements As Object, objElement As Object
' get the <a> tags
Set objElements = objHtml.getElementsByTagName("a")
' we want a tags without anchors and without img
For Each objElement In objElements
' innerText = "" is no anchor
If objElement.innerText = "" Then
' check for <img in innerHtml to avoid a tags with an image
If InStr(1, objElement.innerHtml, "<IMG", vbTextCompare) = 0 Then
Debug.Print objElement.outerHTML
End If
End If
Next objElement
End Sub
Potentially you are scraping this HTML from a webpage using IE automation or something. In this case, it is useful to use the early-bound approach as you will get intellisense on the HTMLDocument object and the methods etc.
I appreciate that my comment (with the SO-classic answer about parsing HTML with regex) may have seemed churlish. However, it is fraught with difficulty and quite often simply an exercise in futility.
Hoping this approach gives you another option if you wish to go down that path.