Parsing HTML in VBA
Granted this parsing routine doesn't do exactly what you're looking for out of the box, it should get you going in the right direction in VBA.
'Requires references to Microsoft Internet Controls and Microsoft HTML Object Library
Sub Extract_TD_text()
Dim URL As String
Dim IE As InternetExplorer
Dim HTMLdoc As HTMLDocument
Dim TDelements As IHTMLElementCollection
Dim TDelement As HTMLTableCell
Dim r As Long
'Saved from www vbaexpress com/forum/forumdisplay.php?f=17
URL = "file://C:\VBAExpress_Excel_Forum.html"
Set IE = New InternetExplorer
With IE
.navigate URL
.Visible = True
'Wait for page to load
While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
Set HTMLdoc = .document
End With
Set TDelements = HTMLdoc.getElementsByTagName("TD")
Sheet1.Cells.ClearContents
r = 0
For Each TDelement In TDelements
'Look for required TD elements - this check is specific to VBA Express forum - modify as required
If TDelement.className = "alt2" And TDelement.Align = "center" Then
Sheet1.Range("A1").Offset(r, 0).Value = TDelement.innerText
r = r + 1
End If
Next
End Sub
Doing it with Regex
It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.
Description
This regex will do the following:
- parse your sample text into individual rows
- collect the row number
- collect the two plain text values
- avoid many of the obscure edge cases that make parsing html with regex difficult
The Regex
<tr\s
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)row_([0-9]+)\1(?:\s|>))
(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
[^<]*</tr>

Notes: for this regex you'll want to use the following flags: Ignore white space, case insensitive, and dots match all characters. To see the image better you can right click and select show in new window.
Exmaple
Given your sample text
<tr style='mso-yfti-irow:8' id="row_65">
<td width=170 valign=top style='width:127.5pt;background:white;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
weight<o:p></o:p></span></p>
</td>
<td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
</td>
</tr>
<tr style='mso-yfti-irow:9' id="row_116">
<td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
or LBS<o:p></o:p></span></p>
</td>
<td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
</td>
</tr>
The regex will create the following capture groups
- capture group 0 gets the entire row
- capture group 1 gets the quote around the row number in the row's id attribute
- capture group 2 gets the row number
- capture group 3 gets the first table cell value
- capture group 4 gets the second table cell value
And the following matches:
[0][0] = <tr style='mso-yfti-irow:8' id="row_65">
<td width=170 valign=top style='width:127.5pt;background:white;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
weight<o:p></o:p></span></p>
</td>
<td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
</td>
</tr>
[0][1] = "
[0][2] = 65
[0][3] = Shipment's
weight
[0][4] = 40120
[1][0] = <tr style='mso-yfti-irow:9' id="row_116">
<td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
or LBS<o:p></o:p></span></p>
</td>
<td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
</td>
</tr>
[1][1] = "
[1][2] = 116
[1][3] = KG
or LBS
[1][4] = LBS
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
<tr '<tr'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
id= 'id='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
row_ 'row_'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
td 'td'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
p 'p'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
</td> '</td>'
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
td 'td'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
p 'p'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
</td> '</td>'
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
</tr> '</tr>'