1

I'm trying to parse the html texts in Excel cells and delete some parts. That texts can contain the different span styles, URLs, classes. I guess the easiest way to do it is RegEx.

I've got the six types of the links (For example. Of course they may have the different attributes and values):

2 without anchors and without <img>(should be selected)

<a href="/"><span style="color: #000000;"></span></a>
<a href="/"></a>

2 without anchors and with <img>(should not be selected)

<a href="/" title=""><span style="color: #000000;"></span><img class="cars"></a>
<a href="/" title=""><img class="cars"></a>

and 2 with anchors (should not be selected)

<a href="/"><span style="color: #000000;">Cars</span></a>
<a href="/">Cars</a>

What RegEx pattern I should use to mark the 2 links without anchors and without <img> only?

I've build the pattern

<a href=".*">(?!<img ".*">)(?:<\/span>)?<\/a>

that marks two types of the links:

<a href="/" title=""><span style="color: #0000;"></span><img class="cars"></a>
<a href="/" title=""><img class="cars"></a>

that contain <img> tag.

But if remove the quotes in <img> tag:

<a href="/" title=""><img class=cars></a>

it works correctly.

VBA Code:

Public Function txtrpl(ByRef x As String) As String`<br>

    With CreateObject("VBScript.RegExp")`<br>
        .Global = True`<br>
        .Pattern = "<a href="".*"">(?!<img "".*"">)(?:<\/span>)?<\/a>"`<br>
        txtrpl= Trim$(.Replace(x, ""))`<br>
    End With

End Function
Community
  • 1
  • 1
FL.Alx
  • 71
  • 1
  • 6
  • 1
    Must... not... post... THAT... link... – Mr Lister Nov 04 '17 at 08:04
  • Let me do it for you @MrLister - please see this question: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Robin Mackenzie Nov 04 '17 at 08:07
  • No code? No examples of desired output? Please familiarize with how this site works by taking the [Two Minute Tour](https://stackoverflow.com/tour). Then read the HELP pages for information as to [What topics can I ask about here?](http://stackoverflow.com/help/on-topic), [How to Ask a Good Question](http://stackoverflow.com/help/how-to-ask), and [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) – Ron Rosenfeld Nov 04 '17 at 10:11
  • VBA code added. – FL.Alx Nov 04 '17 at 10:48
  • @FL.Alx Pardon my ignorance of HTML, but which parts, exactly, in your third example, comprise the **Anchor** – Ron Rosenfeld Nov 04 '17 at 12:54
  • @RonRosenfeld: "Cars". – FL.Alx Nov 04 '17 at 13:07

1 Answers1

1

If you will consider a solution without using regular expressions then you can use the HTMLDocument object.

You can add a reference (Microsoft HTML Object Library) in the VBE to get this library and then use early binding. Or, for my example code below, just use late binding with:

Dim objHtml As Object
Set objHtml = CreateObject("htmlfile")

My example passes a string to create the HTMLDocument and you need to use late binding according this accepted answer.

Anyhow, you can then use methods and properties of the HTMLDocument object to inspect the DOM - I've used getElementsByTagName, innerText and innerHTML below to get the two tags you are interested in. E.g.:

' we want a tags without anchors and without img
For Each objElement In objElements
    ' innerText = "" is no anchor
    If objElement.innerText = "" Then
        ' check for <img in innerHtml to avoid a tags with an image
        If InStr(1, objElement.innerHtml, "<IMG", vbTextCompare) = 0 Then
            Debug.Print objElement.outerHTML
        End If
    End If
Next objElement

Full example:

Option Explicit

Sub ParseATags()

    Dim strHtml As String

    strHtml = ""
    strHtml = strHtml & "<html>"
    strHtml = strHtml & "<body>"
    ' 2 without anchors and without <img>
    strHtml = strHtml & "<a href=""/""><span style=""color: #000000;""></span></a>"
    strHtml = strHtml & "<a href=""/""></a>"
    ' 2 without anchors and with <img>
    strHtml = strHtml & "<a href=""/"" title=""""><span style=""color: #000000;""></span><img class=""cars""></a>"
    strHtml = strHtml & "<a href=""/"" title=""""><img class=""cars""></a>"
    ' and 2 with anchors
    strHtml = strHtml & "<a href=""/""><span style=""color: #000000;"">Cars</span></a><br>"
    strHtml = strHtml & "<a href=""/"">Cars</a><br>"
    strHtml = strHtml & "</body>"
    strHtml = strHtml & "</html>"

    ' must use late binding
    ' https://stackoverflow.com/questions/9995257/mshtml-createdocumentfromstring-instead-of-createdocumentfromurl
    Dim objHtml As Object
    Set objHtml = CreateObject("htmlfile")

    ' add html
    With objHtml
        .Open
        .write strHtml
        .Close
    End With

    ' now parse the document
    Dim objElements As Object, objElement As Object

    ' get the <a> tags
    Set objElements = objHtml.getElementsByTagName("a")

    ' we want a tags without anchors and without img
    For Each objElement In objElements
        ' innerText = "" is no anchor
        If objElement.innerText = "" Then
            ' check for <img in innerHtml to avoid a tags with an image
            If InStr(1, objElement.innerHtml, "<IMG", vbTextCompare) = 0 Then
                Debug.Print objElement.outerHTML
            End If
        End If
    Next objElement

End Sub

Potentially you are scraping this HTML from a webpage using IE automation or something. In this case, it is useful to use the early-bound approach as you will get intellisense on the HTMLDocument object and the methods etc.

I appreciate that my comment (with the SO-classic answer about parsing HTML with regex) may have seemed churlish. However, it is fraught with difficulty and quite often simply an exercise in futility.

Hoping this approach gives you another option if you wish to go down that path.

Robin Mackenzie
  • 18,801
  • 7
  • 38
  • 56
  • I found some nuances in this method: 1 - it changes the case (from lower to upper) of the tags (needs case insensitive replace), 2 - it deletes the " " value (needs to replace "" with " " to found results in the source strings. – FL.Alx Nov 05 '17 at 09:25