0

I am trying to write code that will sift through this DOM structure:

<html>
 <head>
  <body>
   <table id="the-table" border="1">
    <thead>
    <tbody>
     <tr> </tr>
     <tr>
      <td class="x-grid3-hd-inner" bgcolor="#8dd5e7" colspan="7">
     </tr>
     <tr>
      <td class="x-grid3-hd-inner" bgcolor="#8dd5e7" colspan="7">
     </tr>
     <tr>
     <tr>
      <td class="oneline">2</td>
      <td class="oneline">ENB</td>
      <td class="oneline">2</td>
      <td class="oneline">CELL_99</td>
      <td class="oneline">255.255.255.0</td>
      <td class="oneline">My Group</td>
      <td class="oneline">*</td>
     </tr>
     <tr>
     <tr>
     <tr>
     ...
     <tr>
     <tr>
   </tbody>
  </table>
 </body>
</html>

I am trying to extract the text at each td element for all tr elements of the table. I expanded one example out above. All td elements of the table are formatted using the same html structure (besides the title of the table). This is the method that I have used so far.

Sub ParseWebPage(url As String, sheet As String, searchCrit As String)
    Dim objXML As MSXML2.DOMDocument
    Set objXML = New MSXML2.DOMDocument
    Set htm = CreateObject("htmlFile")
    With CreateObject("msxml2.xmlhttp")
      .Open "GET", url, False
      .send
      xmlresp = .responseText
    End With
    objXML.loadXML (xmlresp)
    Dim objElem As MSXML2.IXMLDOMElement
    Debug.Print xmlresp

    objXML.loadXML (xmlresp)
    Set objElem = objXML.selectSingleNode("tr")
    Debug.Print "Found" & objElem.text
End Sub

The problem is, every time my objElem returns back empty. I also tried using a NodeList instead of IXMLDOMElement but it always returned empty.

I believe the issue to be the string argument. I have tried using "tr", "oneline", "/html/body/table/tbody", and creating a loop for each "/html/body/table/tbody/tr[x]/td[y]" but none of these were effective.

Can someone help me out here?

noc_coder
  • 349
  • 1
  • 15
  • Your general approach is correct. I think the issue is non-well-formed XML (which HTML from the web often is). I see a lot of `` tags, but almost no `` closing tags. MSXML expects perfect XML, so empty ones should be ``. Also check `objXML` with `If objXML Is Nothing Then ...` - because you are not catching XML parse errors. If it is Nothing - then there was an error while parsing. Check this guide on how to do it properly [A Beginner's Guide to the XML DOM](https://msdn.microsoft.com/en-us/library/aa468547.aspx) – Logan Reed Dec 22 '16 at 19:15
  • @LoganReed The website is already fully formed and operable and contains all the closing tags, it was possibly a copying error that occurred. I will look at this link and see if that helps. – noc_coder Dec 22 '16 at 19:31
  • Try first with a very small well-formed XML example and then keep extending it until you find the error then. – Logan Reed Dec 22 '16 at 19:49
  • @LoganReed - Arent I going in the wrong direction here since Im trying to parse HTML with XML functions? http://stackoverflow.com/questions/2515097/selectnodes-and-getelementsbytagname – noc_coder Dec 22 '16 at 20:23
  • Well, it depends. If you are writing a general web scraper that will encounter mostly crappy HTML, then yes - this is a wrong way to go. In that case I'd probably try to use and manipulate IE object instead (as HTML parsing will be done by the browser). If you are reading HTML from one website and it is well-formed or it is not, but you can fix it to be well formed -- then XML parser is the way to go as it is easier and faster. – Logan Reed Jan 03 '17 at 15:06

3 Answers3

1
Sub test()
  Dim objList As MSXML2.IXMLDOMNodeList
  Dim objxml As New MSXML2.DOMDocument
  Dim i As Integer

  objxml.Load ("C:\test.xml") 'used load, loadXML would be correct for your use
  Set objList = objxml.SelectNodes("//tr/td")
  For i = 0 To objList.Length - 1
      Debug.Print objList.Item(i).Text
  Next i
End Sub

Used the above code on the following:

<html>
 <head>
  <body>
   <table id="the-table" border="1">
    <thead>
     <tbody>
      <tr>
      <td class="oneline">2</td>
      <td class="oneline">ENB</td>
      <td class="oneline">2</td>
      <td class="oneline">CELL_99</td>
      <td class="oneline">255.255.255.0</td>
      <td class="oneline">My Group</td>
      <td class="oneline">*</td>
     </tr>
   </tbody>
   </thead>
  </table>
 </body>
 </head>
</html>

Had the input file saved as a .xml. I got the desired results from this. This lets me believe that one of the following are happening:

  1. Your earlier xmlresp is not well formed. Can you check it or export objxml to see if it formed correctly?
  2. Your input string is too large for vba's msxml2. I once had this happen to me where xfdf data from adobe exceeded some maximum string length which lead to the input not being formed properly. When I ran a XSL outside of vba or removed the field with the long strings, it worked.
  3. There is a difference between how XML and HTML are treated. I am not that familiar with HTML, so can't comment on that part of your code too much
  • I check on number 2. There is a crap ton, and I do mean that, of white space in the html. It looks awful in the source. – noc_coder Dec 22 '16 at 22:58
0

If you are trying to get all of the texts ("2", "ENB" etc.) in the elements of td, try the following:

Dim objList As MSXML2.IXMLDOMNodeList
Set objList = objXML.SelectNodes("//tr/td")
For i = 0 To objList.Length - 1
   Debug.Print objList.Item(i)
Next i

Hope that works. Didn't have time to test, but copied from similar working code I have.

  • The list of nodes is still empty. It seems like using selectNodes/selectSingleNode is for xml only.. Im trying to reduce it down to a smaller test case now. – noc_coder Dec 22 '16 at 20:16
0
Sub ParseWebPage(url As String, sheet As String, searchCrit As String)
 Dim objXML As MSXML2.DOMDocument
 Set objXML = New MSXML2.DOMDocument
 Set htm = CreateObject("htmlFile")
 Dim tableData() As String
 Dim openPos, closePos As Integer
 Dim midPart As String

 With CreateObject("msxml2.xmlhttp")
    .Open "GET", url, False
    .send
    xmlresp = .responseText
 End With
 objXML.loadXML (xmlresp)
 tableData = Split(xmlresp, searchCrit)

 For i = 12 To UBound(tableData) - 1
    openPos = InStr(tableData(i), Chr(34) & ">")
    closePos = InStr(tableData(i), "</td>")
    midPart = mid(tableData(i), openPos + 2, closePos - openPos - 2)
    Debug.Print midPart
 Next i
End Sub

This is an ok solution for now. But, I will continue to do more research on the topic. I ended up not using the XML libraries at all.

noc_coder
  • 349
  • 1
  • 15