Using RegEx to grab all headings to build a ToC (Classic ASP)

Question

i still try to develope a function that extract from an HTML text all Headings (h1,h2,h3,..) with a id specified to construct a Table of contents.

I've made a simple script using regex but for some strange reason it collect only 1 match (the last one)

here my sample code:

Function RegExResults(strTarget, strPattern)
    dim regEx
    Set regEx = New RegExp
    regEx.Pattern = strPattern
    regEx.Global = True
    regEx.IgnoreCase = True
    regEx.Multiline = True
    Set RegExResults = regEx.Execute(strTarget)
    Set regEx = Nothing
End Function

htmlstr = "<h1>Documentation</h1><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p><h3 id=""one"">How do you smurf a murf?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper.</p><h3 id=""two"">How do many licks does a giraffe?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p>"

regpattern = "<h([1-9]).*id=\""(.*)\"">(.*)</h[1-9]>"

set arrayresult = RegExResults(htmlstr,regpattern) 
For each result in arrayresult
    response.write "count: " & arrayresult.count & "<br><hr>"
    response.write "0: " & result.Submatches(0) & "<br>"
    response.write "1: " & result.Submatches(1) & "<br>"
    response.write "2: " & result.Submatches(2) & "<br>"
Next

I need to extract all headings plus for each one know what kind of heading is (1..9) and the id value to use for jump to the right title paragraph (#ID_value).

I hope someone can help me to find out why this not working as intended.

Thank you

score 1 · Accepted Answer · edited May 23 '17 at 12:32

The .*'s in the pattern are greedy but you need laziness to collect every possible match. Instead you should use .*?'s.

With some improvements, the pattern could be something like below.

regpattern = "<(h[1-9]).*?id=""(.*?)"">(.*?)</\1>" 

' \1 means the same as the 1st group
' backslash (\) is redundant to escape double quotes, so removed it

I'd strongly recommend you to have a look at Repetition with Star and Plus. It's very useful article to understand lazy and greedy repetitions in Regex.

Oh, I almost forgot, You can't parse HTML with Regex, well you should not at least.

Using RegEx to grab all headings to build a ToC (Classic ASP)

1 Answers1