3

Problem Description:

I am looking for a way to access the li-elements between two specific heading-tags only (e.g.from 2nd h3 to 3rd h3 or from 3rd h3 to next h4) in order to create a table of historical events listed on https://de.wikipedia.org/wiki/1._Januar structured along the criteria mentioned in the headings. A major problem (for me ...) is that - other than the h1-heading - the subtitles of the lower levels have no className or id.

Sample of HTML:

<div class="mw-parser-output">
  [...]
  </h3>
  <ul>
    <li><a href="/wiki/153_v._Chr." title="153 v. Chr.">153 v. Chr.</a>: Die <a href="/wiki/Consulat" title="Consulat">Konsuln</a> der <a href="/wiki/R%C3%B6mische_Republik" title="Römische Republik">römischen Republik</a> beginnen ihre Amtszeit erstmals
      am 1. Januar statt am 1. März; daher ist der 1. Januar heute der Jahresanfang.</li>
    <li><span style="visibility:hidden;">0</span><a href="/wiki/45_v._Chr." title="45 v. Chr.">45 v. Chr.</a>: <a href="/wiki/Kalenderreform_des_Gaius_Iulius_Caesar" title="Kalenderreform des Gaius Iulius Caesar">Caesars Reform</a> des <a href="/wiki/R%C3%B6mischer_Kalender"
        title="Römischer Kalender">römischen Kalenders</a> endet. Dieser wird ab 2. Januar 709 <a href="/wiki/Ab_urbe_condita_(Chronologie)" title="Ab urbe condita (Chronologie)">a. u. c.</a> durch den <a href="/wiki/Julianischer_Kalender" title="Julianischer Kalender">julianischen Kalender</a>      ersetzt.</li>
    <li><span style="visibility:hidden;">0</span><a href="/wiki/32_v._Chr." title="32 v. Chr.">32 v. Chr.</a>: <a href="/wiki/Augustus" title="Augustus">Oktavian</a> lässt sich vom <a href="/wiki/R%C3%B6mischer_Senat" title="Römischer Senat">Senat</a> zum
      „Führer Italiens“ (<i><a href="/wiki/Dux_(Titel)" title="Dux (Titel)">dux Italiae</a></i>) ausrufen. Er erklärt <a href="/wiki/Kleopatra_VII." title="Kleopatra VII.">Kleopatra</a> und damit <i><a href="/wiki/De_jure/de_facto" title="De jure/de facto">de facto</a></i>      auch <a href="/wiki/Marcus_Antonius" title="Marcus Antonius">Marcus Antonius</a> den Krieg.</li>
  </ul>
  [...]
  </ul>
  <h4><span id="Inkrafttreten_von_Gesetzen_und_Staatsvertr.C3.A4gen"></span><span class="mw-headline" id="Inkrafttreten_von_Gesetzen_und_Staatsverträgen">Inkrafttreten von Gesetzen und Staatsverträgen</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>
    <a href="/w/index.php?title=1._Januar&amp;veaction=edit&amp;section=3" class="mw-editsection-visualeditor" title="Abschnitt bearbeiten: Inkrafttreten von Gesetzen und Staatsverträgen">Bearbeiten</a><span class="mw-editsection-divider"> | </span>
    <a href="/w/index.php?title=1._Januar&amp;action=edit&amp;section=3" title="Abschnitt bearbeiten: Inkrafttreten von Gesetzen und Staatsverträgen">Quelltext bearbeiten</a><span class="mw-editsection-bracket">]</span></span>
  </h4>
  <p><i>Der 1. Januar wird oft für das Inkrafttreten von Gesetzen und Staatsverträgen verwendet. Das gilt unter anderem für:</i>
  </p>
  <ul>
    <li><a href="/wiki/1812" title="1812">1812</a>: das <i><a href="/wiki/Allgemeines_b%C3%BCrgerliches_Gesetzbuch" title="Allgemeines bürgerliches Gesetzbuch">Allgemeine bürgerliche Gesetzbuch</a></i> <i>(ABGB)</i> in den <a href="/wiki/Habsburgermonarchie#Erblande"
        title="Habsburgermonarchie">habsburgischen Erblanden</a>.</li>
  </ul>
  [...]
  </h4>
  <p><i>Folgende Staaten erhalten am 1. Januar ihre Unabhängigkeit:</i>
  </p>
  <ul>
    [...]
  </ul>
  <h3><span class="mw-headline" id="Wirtschaft">Wirtschaft</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=1._Januar&amp;veaction=edit&amp;section=6" class="mw-editsection-visualeditor" title="Abschnitt bearbeiten: Wirtschaft">Bearbeiten</a>
    <span class="mw-editsection-divider"> | </span><a href="/w/index.php?title=1._Januar&amp;action=edit&amp;section=6" title="Abschnitt bearbeiten: Wirtschaft">Quelltext bearbeiten</a><span class="mw-editsection-bracket">]</span></span>
  </h3>
  <h4><span class="mw-headline" id="Wichtige_Ereignisse_in_der_Weltwirtschaft">Wichtige Ereignisse in der Weltwirtschaft</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=1._Januar&amp;veaction=edit&amp;section=7"
      class="mw-editsection-visualeditor" title="Abschnitt bearbeiten: Wichtige Ereignisse in der Weltwirtschaft">Bearbeiten</a><span class="mw-editsection-divider"> | </span><a href="/w/index.php?title=1._Januar&amp;action=edit&amp;section=7" title="Abschnitt bearbeiten: Wichtige Ereignisse in der Weltwirtschaft">Quelltext bearbeiten</a>
    <span class="mw-editsection-bracket">]</span>
    </span>
  </h4>
  <ul>
    <li><a href="/wiki/1780" title="1780">1780</a>: In <a href="/wiki/Geschichte_Bratislavas" title="Geschichte Bratislavas">Preßburg</a> erscheint die erste ungarische Zeitung <i>Magyar hírmondó</i> („Ungarischer Kurier“).</li>

So far, I only managed to access all the li-elements (more than 1000!) that are not part of the table of contents with the following code:

Experimental Code Example:


Sub HistoricalEvents_Test()
    Dim http   As Object, html As New MSHTML.HTMLDocument
    Dim oLiList As MSHTML.IHTMLDOMChildrenCollection
    Dim data   As String
    Dim r      As Integer
    Dim oWord  As Object, oWordDoc As Object
    Dim wordApp As New Word.Application
    Dim iFirstRow As Integer, iLastRow As Integer

    Set http = CreateObject("MSXML2.XMLHTTP")
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
   
      
    Dim lLiResultList As Long
    Dim lLiResultLoop As Long
    
    Set oLiList = html.querySelectorAll("#toc ~ ul li")
      
    For lLiResultLoop = 0 To oLiList.Length - 1
        Dim oLiChild As Object
        Set oLiChild = oIlList.Item(lLilResultLoop)
            data = oLiChild.innerText   'data = data & vbCrLf & oLiChild.innerText
            Range("B" & lLiResultLoop +1).Value = data
            data = vbNullString
    Next lLiResultLoop
    
    
    Dim j      As Long
    Dim Ws As Worksheet
    Dim rngDB As Range
    Set Ws = ActiveSheet
    Set oWord = CreateObject("Word.Application")

    Set oWordDoc = oWord.Documents.Open("D:\Jahrestage Geschichte.docx")
    iFirstRow = 1             ' "Ws.Cells(1, 2).End(xlDown).Row" used to work fine but suddenly gives same as iLastRow!
    'Debug.Print iFirstRow
    iLastRow = Ws.Cells(ActiveSheet.Rows.Count, "B").End(xlUp).Row
    'Debug.Print iLastRow
    oWord.Visible = True
    
    With wordApp
        With Ws
            Set rngDB = Ws.Range(.Cells(iFirstRow, 2), .Cells(iLastRow, 2))
        End With
        rngDB.Cut
            oWord.Selection.PasteSpecial DataType:=wdPasteText
            oWord.Selection.TypeParagraph
            oWord.Selection = ""
    End With
    
    oWordDoc.Close savechanges:=True
    wordApp.Quit                                  'it doesn't :(
    
End Sub

Description of General Idea/Final Project

The final project is supposed to have a worksheet for every month, each containing a table with a row for every day of the respective month and columns for the different categories according to the (sub-)titles. The Word-output in the code is just an early-stage by-product and something I will round off only if/when the main problem can be solved.

Further Remarks

This is my first contribution on SO. I'm an absolute beginner when it comes to vba and web-scraping (or any kind of coding, scripting or programming for that matter), but I kind of got sucked into it and spent the better part of my winter holiday just to figure out the above code. I wouldn't have been able to accomplish even that poor piece of scripting without the invaluable knowledge shared with noobs like me by the cracks of SO. I've tried out various approaches but I always got stuck at some point, VBA triggering runtime errors and often Excel crashing. In particular, I wasn't able to implement the nextSibling/previousSibling methods or the nodeName selector successfully which I figure might be a promising approach to the problem. So any help or hint would be greatly appreciated!

Working Solution:

Thanks to the feedback on my question I finally managed to figure out a solution that does the job, although maybe not in the most elegant way. The only remaining problem is that strangely the li-elements of the last column are duplicated. So if anyone has a clue how to deal with that ...

Sub SliceHtmlByHeaderTypes4()

    Dim http   As Object, html As MSHTML.HTMLDocument
    Dim sh     As Worksheet
    Set sh = ThisWorkbook.ActiveSheet
    
    Set http = CreateObject("MSXML2.XMLHTTP"): Set html = New MSHTML.HTMLDocument
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
    
    Dim hNodeList As Object
    Dim startPos As Long, endPos As Long
    Dim s      As Integer, e As Integer
    
    Set hNodeList = html.querySelectorAll("#toc ~ h2, #toc ~ h3, #toc ~ h4")
    Debug.Print hNodeList.Length
    
    Do While s < hNodeList.Length - 1
        http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
        http.send
        html.body.innerHTML = http.responseText
        Set hNodeList = html.querySelectorAll("#toc ~ h2, #toc ~ h3, #toc ~ h4")
        startPos = InStr(html.body.outerHTML, hNodeList.Item(s).outerHTML)
        endPos = InStr(html.body.outerHTML, hNodeList.Item(s + 1).outerHTML)
        
        If startPos > 0 And endPos > 0 And endPos > startPos Then
            Dim strS  As String
            strS = Mid$(html.body.outerHTML, startPos, endPos - startPos + 1)
        Else
            MsgBox "Problem slicing string"
            Stop
            Exit Sub
        End If
        
        Dim liList As Object
        
        html.body.innerHTML = strS
        
        Set liList = html.getElementsByTagName("li")
        
        If liList.Length > 0 Then
            Dim i      As Integer
            Dim liText As String
            Dim lc As Integer
            Dim liRange As Range
            lc = (Cells(2, Columns.Count).End(xlToLeft).Column) + 1
            Set liRange = sh.Range(Cells(2, lc), Cells(2, lc))
            
            For i = 0 To liList.Length - 1
                On Error Resume Next
                liText = liList.Item(i).innerText
                liRange.Value = liRange.Value & liText & vbNewLine
                liText = vbNullString
            Next i
            
            strS = vbNullString
            startPos = 0
            endPos = 0
            hNodeList = ""
            i = 0
        End If
        s = s + 1
    Loop
End Sub
slintezgeu
  • 33
  • 6
  • 1
    So what would desired output for the above look like? And where you say `(e.g.from 2nd h3 to 3rd h3 or from 3rd h3 to next h4)` - are you expecting the same code to work for both scenarios/additional scenarios? Say the section was *Politics and World Affairs* what would that end up looking like? I assume that it was you meant by selecting a section between header tags? – QHarr Feb 13 '21 at 23:20
  • 2
    The HTML document is nothing but a text in which elements are uniquely identified. Therefore you can search for one element using the INSTR function, then for the other, and the string between them comprises all elements between them. – Variatus Feb 14 '21 at 00:01
  • @QHarr I'm aware I should have been more specific. I don't expect the same code to work for all the different scenarios. Instead, what I am looking for is a _principle_ or a general _method_ of choosing elements between two tags, in this case header tags, whether they are of the same level or not. – slintezgeu Feb 14 '21 at 20:37
  • And the output format? – QHarr Feb 14 '21 at 20:38
  • @QHarr Sorry, I'm not sure what exactly you mean by "output format". If you mean something different than what I described under "Description of General Idea/Final Project" then please explain. I really _am_ new to this field ... – slintezgeu Feb 14 '21 at 20:46
  • You can use string methods to cut the html and various points, fix into html documents and parse; or use css selectors to filter (do you know python or R?) By output, I mean what would the end result in Excel/Word look like for a couple of entries from a section under consideration. How does what is on the webpage end up looking in Excel/Word? I did not understand this from your description. – QHarr Feb 14 '21 at 20:46
  • @QHarr I know neither python nor R, I'm afraid. By "string methods", do you mean something similar as Variatus mentioned above? I will definitvely have a look at this method, but it will take some time to get familiar with it, I never used it before. – slintezgeu Feb 14 '21 at 20:50
  • 1
    Yes. You find where the start is of header 1 and slice there and slice again at the start of the next header so you only have 1 section of html extracted. You may need to adjust html to be valid for parsing e.g. by ensuring you cut at the start of the header tag etc.... you pass that into .body.innerHTML of HTMLDocument and use the inbuilt HTML parser – QHarr Feb 14 '21 at 20:54
  • Also, couldn't your current loop be restricted to just qualifying nodes by instead starting with `html.querySelectorAll("[class*=toclevel]")` instead of `html.querySelectorAll("div.mw-parser-output ul")` - then you don't need to do the check of if class LIKE ..... during the loop . See https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors – QHarr Feb 14 '21 at 20:58
  • @QHarr Thanks for your explanation! This approach sounds very promising and I will definitively giv it a try. As to your previous question: The structure of the final project is to have worksheets for each month with a first row containing the section titles, e.g. _Politics and World Affairs_ in B1, _Economy_ in C1 etc. and the days of the month in column A. This means that a cell like e.g. B1 will contain the text of the 107 li-elements of that section separated by line breaks. Or less, if I use separate columns for the subsections, which would make a lot of sense ... – slintezgeu Feb 14 '21 at 21:39
  • @QHarr I'm a bit confused. If I don't get it wrong ```html.querySelectorAll("[class*=toclevel]")``` includes all the ```li``` elements with "toclevel" as their className. But these are the ones I want to _exclude_ , hence my If..._not_... loop. – slintezgeu Feb 14 '21 at 22:22
  • 1
    Sorry, I misread that line.My bad. Looks like you want `html.querySelectorAll("#toc ~ ul")` – QHarr Feb 14 '21 at 22:37
  • 1
    No problem! Your new ```html.querySelectorAll("#toc ~ ul")``` works perfectly btw. Thank you very much for that, this method using "~" is totally new to me. How handy! – slintezgeu Feb 14 '21 at 23:08

2 Answers2

1

Instead of using loops, you can just copy and paste the range at once.

Sub HistoricalEvents_Test()
    Dim http   As Object, html As New MSHTML.HTMLDocument
    Dim oUlList As MSHTML.IHTMLDOMChildrenCollection, oLiList As MSHTML.IHTMLDOMChildrenCollection
    Dim data   As String
    Dim r      As Integer
    Dim oWord  As Word.Application ' Object
    Dim oWordDoc As New Word.document ' Object
    Dim wordApp As New Word.Application
    Dim iFirstRow As Integer, iLastRow As Integer

    Set http = CreateObject("MSXML2.XMLHTTP")
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
    
    'this works pretty fast, but a little reformatting on the Word document (yet to be implemented) is needed:
      
    Dim lUlResultList As Long
    Dim lUlResultLoop As Long
    
    Set oUlList = html.querySelectorAll("div.mw-parser-output ul")
    
    lUlResultList = oUlList.Length
    
    For lUlResultLoop = 0 To oUlList.Length - 1
        Dim oUlChild As Object
        Set oUlChild = oUlList.Item(lUlResultLoop)
        If Not oUlChild.FirstChild.className Like "*toclevel*" Then
            data = oUlChild.innerText             'data = data & vbCrLf & oUlChild.innerText
            Range("B" & lUlResultLoop).Value = data
            data = vbNullString
        End If
    Next lUlResultLoop
    
    
    'this works as well, no reformatting needed, but pasting to Word is much slower:
    
    '    Dim lLiResultList As Long
    '    Dim lLiResultLoop As Long
    
    '    Set oLiList = html.querySelectorAll("div.mw-parser-output ul li")
    
    '    lLiResultList = oLiList.Length
    
    '    For lLiResultLoop = 0 To oLiList.Length - 1
    '        Dim oLiChild As Object
    '        Set oLiChild = oLiList.Item(lLiResultLoop)
    '        If Not oLiChild.className Like "*toclevel*" Then
    '            data = oLiChild.innerText           'data = data & vbCrLf & oLiChild.innerText
    '            Range("B" & lLiResultLoop).Value = data
    '            data = vbNullString
    '        End If
    '    Next lLiResultLoop
    
    '********************************************************************************************
    
    Dim j      As Long
    Dim Ws As Worksheet
    Dim rngDB As Range
    Set Ws = ActiveSheet
    Set oWord = CreateObject("Word.Application")

    Set oWordDoc = oWord.Documents.Open("D:\Jahrestage Geschichte.docx")
    'Set oWordDoc = oWord.Documents.Add
    iFirstRow = Ws.Cells(1, 2).End(xlDown).Row
    'Debug.Print iFirstRow
    iLastRow = Ws.Cells(ActiveSheet.Rows.Count, "B").End(xlUp).Row
    'Debug.Print iLastRow
    oWord.Visible = True
    
    With wordApp
        With Ws
            Set rngDB = Ws.Range(.Cells(iFirstRow, 2), .Cells(iLastRow, 2))
        End With
        rngDB.Cut
            oWord.Selection.PasteSpecial DataType:=wdPasteText
            oWord.Selection.TypeParagraph
'        For r = iFirstRow To iLastRow
'            Range(Cells(r, 2), Cells(r, 2)).Cut
'            oWord.Selection.PasteSpecial DataType:=wdPasteText
'            oWord.Selection.TypeParagraph
'            For j = 1 To 4
'                Dim t  As Double
'                t = Timer
'                Do Until Timer - t >= 0.4         'can't go faster or error 4605 occurs!
'                    DoEvents
'                Loop
'            Next
            oWord.Selection.TypeParagraph
            oWord.Selection = ""
        'Next r
    End With
    
    oWordDoc.Close savechanges:=True
    wordApp.Quit                                  'it doesn't :(
    
End Sub
Dy.Lee
  • 7,527
  • 1
  • 12
  • 14
  • Amazing! Your solution works perfectly and at lightning speed, too!! Thank you very much! – slintezgeu Feb 14 '21 at 21:19
  • Not sure if my accepting this answer related to the Word-export aspect means that my _main_ question about the selection of elements beween two tags is regarded as answered as well. In fact, it isn't quite yet but I have received a lot of valuable input to continue developing my project with. So thank you all for your feedback! – slintezgeu Feb 15 '21 at 00:49
1

Here is an example of what I meant by narrowing down HTML to between specific headers (positional and based on header type e.g. h2 headers, first and second). Css selectors are used to ensure h2 list includes only the desired h2 elements list to then select positionally from i.e. pick the first and second items of; .outerHTML is used to ensure I slice at the start of element's html preserving tag structure for inserting back into HTML.body.innerHTML. I then query the HTMLDocument with subset html content for the li elements.

You could can hopefully see how the principal could be expanded. In order to mix header types (being cautious of css selectivity and ordering) you can use OR syntax to retrieve mixed header nodeLists at start e.g. h2 and h3 would be html.querySelectorAll("#toc ~ h2, #toc ~ h3"). Be very cautious and verify that the returned nodeList is ordered as expected when using this latter approach.


Option Explicit

Public Sub SliceHtmlByHeaderTypes()
    
    Dim http As Object, html As MSHTML.HTMLDocument
    
    Set http = CreateObject("MSXML2.XMLHTTP"): Set html = New MSHTML.HTMLDocument
    http.Open "GET", "https://de.wikipedia.org/wiki/1._Januar", False
    http.send
    html.body.innerHTML = http.responseText
    
    Dim hNodeList As Object
    
    Set hNodeList = html.querySelectorAll("#toc ~ h2")
    
    Dim startPos As Long, endPos As Long
    
    startPos = InStr(html.body.outerHTML, hNodeList.Item(0).outerHTML) ' we wanna split between 1st and 2nd i.e. indices 0 and 1
    endPos = InStr(html.body.outerHTML, hNodeList.Item(1).outerHTML)
    
    Debug.Print hNodeList.Item(0).innerText
    Debug.Print hNodeList.Item(1).innerText
    
    If startPos > 0 And endPos > 0 And endPos > startPos Then
        Dim s As String
        s = Mid$(html.body.outerHTML, startPos, endPos - startPos + 1)
    Else
       MsgBox "Problem slicing string"
       Exit Sub
    End If
    
    Dim liList As MSHTML.IHTMLElementCollection
    
    html.body.innerHTML = s  'replace html content with new spliced content
    
    Set liList = html.getElementsByTagName("li") 'then do something with list of lis
    
    Stop

End Sub

Bear in mind there are a host of wiki apis out there that may serve the content you want e.g.

https://stackoverflow.com/questions/627594/is-there-a-wikipedia-api

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • 1
    This is extremely helpful! Exactly the kind of example I was looking for and explained in a way that even a beginner like me can understand. I will continue with my project based on this input. Thank you very much! Btw, I had to change the declaration of ```liList``` to "Object", otherwise a runtime error 13 would be triggered. It doesn't seem to affect the functionality, though. – slintezgeu Feb 15 '21 at 14:25
  • For some reason I can't explain ```querySelectorAll``` does no longer return any matches and ```hNodeList.Length``` is always Null. I haven't made any changes to the code whatsoever so I really don't have a clue why it suddenly stopped working. The html of the website seems unchanged to me, too. Any idea what the problem might be? – slintezgeu Feb 23 '21 at 09:25
  • I cannot reproduce this as works for me. Are you running the above as is? – QHarr Feb 23 '21 at 17:32
  • Yes, 100% as is, that's why I'm at a complete loss. – slintezgeu Feb 23 '21 at 20:24
  • 1
    Do you get any matches with `html.getElementsByClassName("mw-parser-output")(0).getElementsByTagName("h2")` ? – QHarr Feb 23 '21 at 20:56
  • Yes, that works. ```hNodeList.Length``` is 6 and ```liList``` contains 40 items. – slintezgeu Feb 23 '21 at 21:15
  • Then use that but skip the first retrieved item as it is a parent header. Or don't if liList is fine. – QHarr Feb 23 '21 at 21:22
  • 1
    Ok, I will use that then until I figure out why the ```querySelectorAll``` method has stopped working all of a sudden. Thanks! – slintezgeu Feb 23 '21 at 21:29
  • @slintezgeu See https://stackoverflow.com/questions/67596872/how-to-hold-a-reference-to-the-items-matched-by-queryselectorall-in-a-variabl – QHarr Apr 14 '22 at 02:33