-1

I have a question on extracting from an html page using Regular Expressions. The regex I used is supposed to extract from the span(all 4 of them) but it's not functioning. Please, look at the code I tried and the and the HTML tag I want to extract from.

HTML

<div class="content-wrapper">

    <a class="klose"href="https://www.anysiteAtall.com">
        <span class="title">The good big book</span>
        <span id="place" class="country">America</span>
        <span class="price">$300</span>
        <span class="color">white</span>
    </a>
</div>

MY CODE

   Dim span_matchsingle As New Regex(
       "<span[^<>]*class=""color""[^<>]*>(?<meTIT>.*?)</span>" & _
       "<span[^<>]*class=""title""[^<>]*>(?<destn>.*?)</span>" & _
       "<span[^<>]*class=""country""[^<>]*>(?<AtG>.*?)</span>" & _
       "<span[^<>]*class=""price""[^<>]*>(?<meVIEW>.*?)</span>")


   Dim matches As MatchCollection = span_matchsingle.Matches(Me.TextBox1.Text, RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

    For Each m As Match In matches


        Dim actualD As String = m.Groups("meTIT").Value
        Dim actss As String = m.Groups("AtG").Value
        Dim actunm As String = m.Groups("destn").Value
        Dim actualzx As String = m.Groups("meVIEW").Value

        'pass them all into the listview

        Dim lvi As New ListViewItem
        lvi.Text = actualD
       lvi.SubItems.Add(actss)
        lvi.SubItems.Add(actunm)
        lvi.SubItems.Add(actualzx)
        Me.ListView1.Items.Add(lvi)

       '''''''''''''''''''''''''''''''''''''''''
        '''''''''''''''''''''''''''''''''''''''''

    Next

This is the code I tried but it did not extract the innertext from span except when I include just one span in the regex and that is not what I want.

Regular Jo
  • 5,190
  • 3
  • 25
  • 47

1 Answers1

0

Please understand, there are some people here who are great at regex, but relying on regex to parse html can become a very frustrating experience. Many of us love regex and make capturing groups in our Alphabits cereal (you can splice in some cheerios you've bitten in half for the parentheses), but html is one job regex is not suited for. People don't say "Don't use regex" to dodge helping, they say it because using a proper tool for the task is helping you.

Here's why you're getting the response of "Don't use regex to parse html".

<span[\s\S]*?>[\s\S]*?</span>

Will match what you want.

Unless there's a nested span <span><span><span></span>

<span[\s\S]*?>[\s\S]*</span>

Will do this

Unless there's two spans like <span></span><span></span>

The last regex won't match that because it will consume the whole string. The first regex will match <span><span></span>.

Now sure, you can use alternation to cover various nesting patterns but it becomes slower, monstrous to read, hard to modify, and a whole lot of other headaches.

Further, These make no account for the potential of >s in attributes of the span tag, but that's workable

<span(\s*\w+="[^"]*")+>...

But then you have to consider quoting styles.

<span(\s*\w+=(?:(["'])?(.*?)\2))+>

And still then you have to consider nested quotes

Regular Jo
  • 5,190
  • 3
  • 25
  • 47