0

I want to extract data from HTML file with regex expression but I have no idea what pattern should I use. The html code comes from an email.

The following is part of the html code. I want to be able to get "40120 LBS".

What would the pattern looks like?

I thought about something like: Shipment's weight [any char] [0-9][0-9][0-9][0-9][0-9]

..etc

Maybe you know something more efficient to achieve what I want. Thank you.

<tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>
  • 1
    You don't want to [use regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for this. Find an html-parsing library. – Blorgbeard May 12 '16 at 22:34
  • It's not clear what exactly you need to get out of the html – Laurel May 12 '16 at 22:57

2 Answers2

2

Rather than using RegExp to parse the HTML file, use a DOM parser.

The most straightforward way to do it is to add a reference to Microsoft HTML Object Library and use it. Getting to know the objects can be a little tricky, but not as tricky as trying to handle HTML using regular expressions!

The key is determining what rule you want to use to extract the value.

Here's an example that (hopefully) demonstrates the technique.

Public Sub SimpleParser()
  Dim doc As MSHTML.HTMLDocument
  Dim b As MSHTML.HTMLBody
  Dim tr As MSHTML.HTMLTableRow, td As MSHTML.HTMLTableCell
  Dim columnNumber As Long, rowNumber As Long
  Dim trCells As MSHTML.IHTMLElementCollection
  Set doc = New MSHTML.HTMLDocument
  doc.body.innerHTML = "<table><tr style='mso-yfti-irow:8' id=""row_65""> <td width=170 valign=top style='width:127.5pt;background:white; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>Shipment's weight<o:p></o:p></span></p> </td> <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>40120<o:p></o:p></span></p> </td> </tr> <tr style='mso-yfti-irow:9' id=""row_116""> <td width=170 valign=top style='width:127.5pt;background:#F3F3F3; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>KG or LBS<o:p></o:p></span></p> </td> <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>LBS<o:p></o:p></span></p> </td> </tr></table>"
  Set b = doc.body
  'Example of looping through elements
  For Each tr In b.getElementsByTagName("tr")
    rowNumber = rowNumber + 1
    columnNumber = 0
    For Each td In tr.getElementsByTagName("td")
      columnNumber = columnNumber + 1
      Debug.Print rowNumber & "," & columnNumber, td.innerText
    Next
  Next
  'Go through each row; if the first cell is "Shipment's weight", display the next cell.
  For Each tr In b.getElementsByTagName("tr")
    Set trCells = tr.getElementsByTagName("td")
    If trCells.Item(0).innerText = "Shipment's weight" Then Debug.Print "Weight: " & trCells.Item(1).innerText
  Next

End Sub
C. White
  • 802
  • 1
  • 7
  • 19
1

Parsing HTML in VBA

Granted this parsing routine doesn't do exactly what you're looking for out of the box, it should get you going in the right direction in VBA.

 'Requires references to Microsoft Internet Controls and Microsoft HTML Object Library
 
Sub Extract_TD_text() 
     
    Dim URL As String 
    Dim IE As InternetExplorer 
    Dim HTMLdoc As HTMLDocument 
    Dim TDelements As IHTMLElementCollection 
    Dim TDelement As HTMLTableCell 
    Dim r As Long 
     
     'Saved from www vbaexpress com/forum/forumdisplay.php?f=17
    URL = "file://C:\VBAExpress_Excel_Forum.html" 
     
    Set IE = New InternetExplorer 
     
    With IE 
        .navigate URL 
        .Visible = True 
         
         'Wait for page to load
        While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend 
             
            Set HTMLdoc = .document 
        End With 
         
        Set TDelements = HTMLdoc.getElementsByTagName("TD") 
         
        Sheet1.Cells.ClearContents 
         
        r = 0 
        For Each TDelement In TDelements 
             'Look for required TD elements - this check is specific to VBA Express forum - modify as required
            If TDelement.className = "alt2" And TDelement.Align = "center" Then 
                Sheet1.Range("A1").Offset(r, 0).Value = TDelement.innerText 
                r = r + 1 
            End If 
        Next 
         
    End Sub 

Doing it with Regex

It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.

Description

This regex will do the following:

  • parse your sample text into individual rows
  • collect the row number
  • collect the two plain text values
  • avoid many of the obscure edge cases that make parsing html with regex difficult

The Regex

<tr\s
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)row_([0-9]+)\1(?:\s|>))
(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
[^<]*</tr>

Regular expression visualization

Notes: for this regex you'll want to use the following flags: Ignore white space, case insensitive, and dots match all characters. To see the image better you can right click and select show in new window.

Exmaple

Given your sample text

<tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>

The regex will create the following capture groups

  • capture group 0 gets the entire row
  • capture group 1 gets the quote around the row number in the row's id attribute
  • capture group 2 gets the row number
  • capture group 3 gets the first table cell value
  • capture group 4 gets the second table cell value

And the following matches:

[0][0] = <tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
[0][1] = "
[0][2] = 65
[0][3] = Shipment's
  weight
[0][4] = 40120

[1][0] = <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>
[1][1] = "
[1][2] = 116
[1][3] = KG
  or LBS
[1][4] = LBS

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  <tr                      '<tr'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    id=                      'id='
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      ['"]?                    any character of: ''', '"' (optional
                               (matching the most amount possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    row_                     'row_'
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      [0-9]+                   any character of: '0' to '9' (1 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    \1                       what was matched by capture \1
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      td                       'td'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      p                        'p'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  </td>                    '</td>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      td                       'td'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      p                        'p'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  </td>                    '</td>'
----------------------------------------------------------------------
  [^<]*                    any character except: '<' (0 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
  </tr>                    '</tr>'
Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • 1
    Thanks so much for the feedback. I promise, I'll find something more efficient to do what I want. – Mr.Bi0hazard May 13 '16 at 01:23
  • Improving efficiency is the joy of programming. good luck with your efforts, and if this or another answer helps you out, please mark it as accepted. – Ro Yo Mi May 13 '16 at 01:32
  • 1
    The RegEx and explanation are pretty inspired. If possible, though, I'd skip using the Internet Explorer objects; if you must load from files, I'd use MSXML. – C. White May 13 '16 at 01:33
  • 1
    Agreed, loading an IE object is painful, however I figured the OP was asking about parsing HTML they already had, so it would be an ok example. – Ro Yo Mi May 13 '16 at 01:36
  • You are correct. I'm saving an outlook email as HTML and then will load the file into VBA. – Mr.Bi0hazard May 14 '16 at 18:35