2

So, I have this spreadsheet with a list of about 5000 URLs. (All pages on our corporate intranet.)

We know some of the links are broken, but don't know of a good way to determine which without clicking all 5000 links.

Normally this would be a simple matter: Create a web page with links to the 5000 pages, and then check the links with a tool like Xenu Link Sleuth.

But that won't work in this case because many of the links are being redirected, and the redirect code spoofs HTTP.response 200, which tricks Xenu into treating it as a valid URL.

However, there is some good news: The redirect script does not run from within Excel. If you click a bad link inside Excel, the redirect script does not execute and the HTTP response is reported back to Excel. I believe Excel should be able to identify the correct HTTP response code (404) - or at least whether the link was valid or not.

Which brings me to my question:

Is there a way using VBA to write a script that would click through every link and capture the result? The result captured could be in the form of the HTTP response code or anything else you think would be useful in finding the bad links in this list of 5000 pages. Ideally the result would be written to a cell in the spreadsheet adjacent to the link.

If anyone if familiar enough with VBA to suggest a solution to this problem, I would be eternally grateful!

John Cherry
  • 189
  • 1
  • 2
  • 10

2 Answers2

4

Here is an example to check the status line from a list of URL with Excel:

Sub TestLinks()
  Dim source As Range, req As Object, url$
  Set req = CreateObject("Msxml2.ServerXMLHTTP.6.0")

  ' define were the links and results are
  Set source = Range("A1:B2")

  ' clear the results
  source.Columns(2).Clear

  ' iterate each row
  For i = 1 To source.Rows.count
    ' get the link from the first column
    url = source.Cells(i, 1)

    ' send the request using a HEAD to check the status line
    req.Open "HEAD", url, False
    req.setRequestHeader "Accept", "image/webp,image/*,*/*;q=0.8"
    req.setRequestHeader "Accept-Language", "en-GB,en-US;q=0.8,en;q=0.6"
    req.setRequestHeader "Accept-Encoding", "gzip, deflate"
    req.setRequestHeader "Cache-Control", "no-cache"
    req.setRequestHeader "Content-Type", "text/xml; charset=utf-8"
    req.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
    req.Send

    ' write the result in the second column
    source.Cells(i, 2) = req.Status
  Next

  MsgBox "Finished!"
End Sub
Florent B.
  • 41,537
  • 7
  • 86
  • 101
  • 1
    Using Xmlhttp is possible, too. For me, I found using WinHttpRequest to be a bit more robust. – Sebastian B Mar 17 '16 at 15:43
  • Thanks very much, @florentbr! I'm still learning my way around the Visual Basic window in Excel. I'm going to try to get @bioschaf's script working first and then will try to get yours to work. I do appreciate the help! – John Cherry Mar 17 '16 at 16:39
3

Use a user defined function to return HTML-Status Codes and drag it down next to the links. Might take a while for Excel to check 5000 links, though.

Public Function CheckURL(url As String) As String
Dim request As New WinHttpRequest
request.Open "GET", url
request.Send
CheckURL = request.Status
End Function

You will probably need to add a reference to "Microsoft WinHTTP Services" under "Extras" -> "References"

Sebastian B
  • 441
  • 2
  • 8
  • Hi bioschaf, thank you very much for the help! I've hardly ever touched Excel VBA. Would you mind giving me a little guidance? In my spreadsheet I have enabled the developer tab, and then opened the visual basic window. From there, I pasted your code into the code area. Then back in the spreadsheet, I attempted to execute the function using =CheckURL(A2). But this results in an error. Any idea what I'm doing wrong? (I did enable Microsoft WinHTTP Services" as suggested.) – John Cherry Mar 17 '16 at 16:35
  • Hi, you need to insert the function in a module to make it work. So in the VBA-windows, right click in your project, choose "Insert -> Module" and paste the code there. Then you should be able to use the function within excel-cells. – Sebastian B Mar 17 '16 at 16:39
  • Awesome, thank you very much. That worked. I tested on 3 rows and got results - now I'm copying down 5000 rows and Excel is "busy." :-) – John Cherry Mar 17 '16 at 16:50
  • Yes, it will take a lot of time. Maybe it would be a good idea to test with some hyperlinks that should return 404 - just to make sure the macro produces the correct results. – Sebastian B Mar 17 '16 at 16:52
  • Good thinking. I've been playing with it; the results are mixed so far. I tested with some invalid URLs but got 401 instead of 404. I tested with some redirected URLs as well, and these also returned 401. And most baffling, I tested a couple of VALID URLs and they also returned 401. But not everything is returning 401. There are a handful of 200's, a few 403's and 404's, some 503's, and some that show #VALUE! All of the #VALUE!'s correspond to pages on sites that no longer exist. – John Cherry Mar 17 '16 at 17:14
  • Based on Fiddler results, the results showing #VALUE! in Excel are HTTP 502 Bad Gateway. – John Cherry Mar 17 '16 at 17:19
  • Sorry to hear that it does not work as expected. When I tested it myself, I found that because we have a proxy, it would produce strange results for external sites. Intranet worked as expected. – Sebastian B Mar 17 '16 at 17:23
  • Oh, no worries, I greatly appreciate your help. This is producing some useful information but also some results that require more analysis. All the ones showing #VALUE can be correctly identified as bad links. I just realized (using Fiddler) that the reason the 404s are showing 401 is because our servers are configured to make all HTTP requests anonymous initially. In the browser, the request is immediately resubmitted with credentials so the user never noticed an error, but in Excel we're just getting the initial 401 response and not the actual results from the re-request with credentials. – John Cherry Mar 17 '16 at 17:28
  • Also, even beyond the help with the immediate problem I appreciate your helping me learn a little bit about VBA in Excel! – John Cherry Mar 17 '16 at 17:28
  • You are welcome. Maybe you could try using a browser object (or the XML server from the other answer) to get more meaningful results, but I am not sure to what server codes it might lead if those methods would resubmit the original request. – Sebastian B Mar 17 '16 at 17:34
  • Using a `HEAD` request instead of `GET` will speed things up. You don't need the contents of the URL, just whether it exists. – Paul Ogilvie Nov 29 '17 at 09:53