I have a directory that has many HTML files in it. Many of these files link to other files. In some cases, I have the href created before I have created the file that the href points to. In other cases, the href will be to an external web site, and the owner of that site will remove or rename the file.
I am looking for a program that will go through a bunch of files, for each file, create a list of hrefs, and the for each href, verify that it does not 404. I can write such a program in python using beautiful soup, but I want to ask if such a program already exists. I cannot believe that nobody has written such a program. I can believe that I am an incompetent searcher.
I have tried https://app.hreflang.org/. That's the wrong tool for this problem.
Fetch all href link using selenium in python assumes that my content is on a website. I want to test before I publish. I found How to save all files associated with a website? but that reference itself is broken (which is a little meta, but I digress). From what I read in its documentation, scrapy is also looking for pages on a web server. Octoparse I think has the same limitation.
This should be easy to do. I start with any HTML page in that directory and explore it using python and beautiful soup. I add the href to the page as a key in a dictionary, and all of the hrefs on that page go to a list which is the value of that key. Then, I start iterating through that list. If a page in that list is already present in the dictionary, then this page has been scanned and may be ignored. At the end, if there is an HTML file in the directory that does not appear in the dictionary, then the file has no references to it. If there is a reference to a local file and that file does not exist, then it is a broken link. If there is a reference to an external page and it generates a 4xx error, then that link is broken. 5xx errors are a different problems and 301 and 302 returns should be handled as appropriate. This should be easy, which is why I am convinced that I am incompetent at searching.