Verify HTML links in a large directory structure on my local harddrive

Question

I am writing a quick (hopefully) C# app to crawl through a package on my local harddrive that needs to open every html file in a tree and verify that every link within those files points are valid. I can think of a bunch of ways of doing this from low level grep-ing of hrefs and dir/file scanning to opening a web browser and catching 404 exceptions. My question is more a matter of efficiency as this has to happen across a ton of files. What method, for local files only, should I look into using?

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Luizgrs, Nov 13 '14 at 19:10
possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) — Luizgrs, Nov 13 '14 at 19:11
Love the first post - can't believe I missed it when I was searching :) — Michael Dorgan, Nov 13 '14 at 19:12
HtmlAgilityPack looks... interesting. But, no examples, no docs that work, and the examples on his page don't even compile. — Michael Dorgan, Nov 13 '14 at 19:43

score 1 · Accepted Answer · answered Nov 13 '14 at 19:14

1

Don't grep, that's error-prone. Don't open a web browser, that's hacky and slow.

I would just parse the HTML with some existing library, extract all hrefs, convert to file paths and check the existence of the files with System.IO.File.Exists.

answered Nov 13 '14 at 19:14

Sebastian Negraszus

11,915
7
43
70

score 1 · Answer 2 · answered Nov 13 '14 at 19:32

My guess is that this is a project somewhat under your control. In that case, any errors you find you plan on fixing or having someone fix. Also, if you are looking to crawl through files and feel like that can give you some benefit, this is all or mostly static HTML. If all these assumptions are true, at the risk of raising the ire of those in the other questions who say you can't "parse html" with Regex, I actually do recommend using Regex. IMHO, you are looking for either href="url" or src="url". That shouldn't be particularly error prone. There is a chance you could miss something but you don't NEED to parse the entire HTML DOM just to find those 2 relatively simple patterns.

That being said, if I were doing this I would loop through Regex.Matches and then use Path.Combine to merge relative path with the root folder and use File.Exists like Sebastian recommends. For absolute URLs that are external, I would use HttpWebRequest. In addition, I would queue up all of the requests and make get the responses async.

All files are absolutely local and locked behind blazing walls of fire. Scraping for filenames will work, but the refs also jump around on the pages themselves and I'd be nice if it handled that too. And, the rabbit holes goes all the way down... This is why I came to ask. Such a simple sound request... :) — Michael Dorgan, Nov 13 '14 at 19:53

Luizgrs · Answer 3 · 2014-11-14T10:37:08.500

1

Using HTMLAgilityPack:

HtmlDocument doc = new HtmlDocument(); 
doc.Load("file.htm"); 
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) {
    if(System.IO.File.Exists(link.Attributes["href"].Value))
    {
       //your file exists
    }
}

Most part of the code above is from their own example page.

You might need some additional work on the href attr.

edited Nov 14 '14 at 10:37

answered Nov 14 '14 at 10:26

Luizgrs

4,765
1
22
28

I'm not able to save a working code in dotnetfiddle.net but here is another example https://dotnetfiddle.net/fnDPLB – Luizgrs Nov 14 '14 at 10:49
I got it all working - thanks for the help everyone! – Michael Dorgan Nov 14 '14 at 23:54

Verify HTML links in a large directory structure on my local harddrive

3 Answers3