Web-scraping across multipages without even knowing the last page number

Question

Running my code for a site to crawl the titles of different tutorials spreading across several pages, I found it working flawless. I tried to write some code not depending on the last page number the url has but on the status code until it shows http.status<>200. The code I'm pasting below is working impeccably in this case. However, Trouble comes up when I try to use another url to see whether it breaks automatically but found that the code did fetch all the results but did not break. What is the workaround in this case so that the code will break when it is done and stop the macro? Here is the working one?

Sub WiseOwl()
Const mlink = "http://www.wiseowl.co.uk/videos/default"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object

Do While True
     y = y + 1
    With http
        .Open "GET", mlink & "-" & y & ".htm", False
        .send
        If .Status <> 200 Then
            MsgBox "It's done"
            Exit Sub
        End If
        html.body.innerHTML = .responseText
    End With

    For Each post In html.getElementsByClassName("woVideoListDefaultSeriesTitle")
        With post.getElementsByTagName("a")
            x = x + 1
            If .Length Then Cells(x, 1) = .item(0).innerText
        End With
    Next post
Loop
End Sub

I found a logic to get around with yellowpage. My update script is able to parse yellowpage but breaks before scraping the last page because there is no "Next Page" button. I tried with this: "https://www.dropbox.com/s/iptqm79b0byw3dz/Yellowpage.txt?dl=0"

However, the same logic I tried to apply with torrent site but it doesn't work here:

"https://www.yify-torrent.org/genres/western/p-1/"

The Yellow Pages site just returns the last page available if you use (eg) "...page=999" It still returns 200 as status. — Tim Williams, Jul 19 '17 at 20:24
You might just have to match something in the page as well to make sure you still have results. For general redirections, you can try [disabling redirection for your HTTP request](https://stackoverflow.com/questions/161343/how-do-i-prevent-serverxmlhttp-from-automatically-following-redirects-http-303) and catch the HTTP status code to determine if it's trying to redirect you to a page you didn't intend on going to. — Hao Zhang, Jul 19 '17 at 20:29
Thanks for your answer, Hao Zhang. Seems to be working. Get back to you after checking few more urls. Thanks. — SIM, Jul 19 '17 at 20:43
Sorry Hao Zhang, it couldn't do the trick. I tried with yellow page but saw it running incessantly. — SIM, Jul 19 '17 at 20:56
I noticed that every result entry has the class name "result". Now you just need to find out an alternative to getElementsByClassName since it's apparently not available. Of course, there's no way for you to determine this for every page you run across, because they all have different behaviors that tell us when they are out of results. What I just told you only works for that Yellow Pages site. — Hao Zhang, Jul 19 '17 at 21:05

score 1 · Accepted Answer · answered Jul 20 '17 at 21:13

You can always rely on elements if they exits or not. Here for example, if you try to use the object which you have set your element to, you will get:

Run-time error '91': Object variable or With block variable not set

This is the key you should be looking for to put an end to your code. Please see the below example:

Sub yify()
Const mlink = "https://www.yify-torrent.org/genres/western/p-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object
Dim posts As Object

y = 1
Do
    With http
        .Open "GET", mlink & y & "/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set posts = html.getElementsByClassName("mv")
    On Error GoTo Endofpage
    Debug.Print Len(posts) 'to force Error 91

    For Each post In posts
        With post.getElementsByTagName("div")
            x = x + 1
            If .Length Then Cells(x, 1) = .Item(0).innerText
        End With
    Next post
    y = y + 1
Endofpage:
Loop Until Err.Number = 91
Debug.Print "It's over"
End Sub

You are impossible Tehscript!!!! I run with two problematic sites and both of them rock as your code always does. It's an Oscar winning code. I never see this style. Thanks for everything. Long live!!!! — SIM, Jul 20 '17 at 21:37
My last ever urge to you to take a look into this link. I'm a newbie in writing class so I can't rectify my mistakes. In your spare time plz plz plz "https://stackoverflow.com/questions/45224058/class-crawler-written-in-python-throws-attribute-error" . Stay well. — SIM, Jul 20 '17 at 21:43

Web-scraping across multipages without even knowing the last page number

1 Answers1

Linked