0

I have an html file with many <a> tags with href links.

I would like to have the page do nothing when these links point to an outside url (http://....) or an internal link that is broken.

The final goal is to have the html page used offline without having any broken links. Any thoughts?

I have tried using a Python script to change all links but it got very messy.

Currently I am trying to use JavaScript and calls such as $("a").click(function(event) {} to handle these clicks, but these have not been working offline.

Also, caching the pages will not be an option because they will never be opened online. In the long run, this may also need to be adapted to src attributes, and will be used in thousands of html files.

Lastly, it would be preferable to use only standard and built in libraries, as external libraries may not be accessible in the final solution.

UPDATE: This is what I have tried so far:

//Register link clicks
$("a").click(function(event) {
    checkLink(this, event);
});

//Checks to see if the clicked link is available
function checkLink(link, event){

    //Is this an outside link?
    var outside = (link.href).indexOf("http") >= 0 || (link.href).indexOf("https") >= 0;

    //Is this an internal link?
    if (!outside) {
        if (isInside(link.href)){
            console.log("GOOD INSIDE LINK CLICKED: " + link.href);
            return true;
        }
        else{
            console.log("BROKEN INSIDE LINK CLICKED: " + link.href);
            event.preventDefault();
            return false;
        }
    }
    else {
        //This is outside, so stop the event
        console.log("OUTSIDE LINK CLICKED: " + link.href);
        event.preventDefault();
        return false;
    }
}

//DOESNT WORK
function isInside(link){
    $.ajax({
        url: link, //or your url
        success: function(data){
            return true;
        },
        error: function(data){
            return false;
        },
    })
}

Also an example:

<a href="http://google.com">Outside Link</a>             : Do Nothing ('#')
<a href="https://google.com">Outside Link</a>            : Do Nothing ('#')
<a href="/my/file.html">Existing Inside Link</a>         : Follow Link
<a href="/my/otherfile.html">Inexistent Inside Link</a>  : Do Nothing ('#')
vontell
  • 342
  • 3
  • 17
  • If you post some of your attempts we may be able to guide you better – junnytony Jun 12 '15 at 01:38
  • Your `isInside()` function won't work the way you want because the `ajax` call is asynchronous and the function returns right after issuing the `ajax` call. Are you opposed to pre-parsing the file in python, replacing dead links with `href=#` and using that? – junnytony Jun 12 '15 at 02:00
  • I am not entirely opposed to pre-parsing the files, except in the overall scheme of things, the entire project I am working with has 150,000+ links and thousands of html files. My concern with ajax and jquery is also whether they will work offline. – vontell Jun 12 '15 at 02:04
  • ah.. I see... If these files are not dynamically generated, preparsing and caching the html files may be a good option. jQuery runs in your browser and as long as you have a webserver running "inside", AJAX calls should work as well.BTW. by "inside link" do you mean accessing a file on a local webserver? – junnytony Jun 12 '15 at 02:13
  • Yes by 'inside link' I mean links accessible on this internal, local server. Is there a way to cache these files without an initial online version and with offline devices? – vontell Jun 12 '15 at 02:19

2 Answers2

0

Here is some javascript that will prevent you from going to external site:

var anchors = document.getElementsByTagName('a');
    for(var i=0, ii=anchors.length; i < ii; i++){
        anchors[i].addEventListener('click',function(evt){
            if(this.href.slice(0,4) === "http"){
                evt.preventDefault();
            }
        });
    }

EDIT: As far as checking if a local path is good on the client side, you would have to send and ajax call and then check the status code of the call (infamous 404). However, you can't do ajax from a static html file (e.g. file://index.html). It would need to be running on some kind of local server.

Here is another stackoverflow that talks about that issue.

Community
  • 1
  • 1
Sam Eaton
  • 1,795
  • 1
  • 14
  • 19
0

Javascript based solution:

If you want to use javascript, you can fix your isInside() function by setting the $.ajax() to be non asynchronous. That is will cause it to wait for a response before returning. See jQuery.ajax. Pay attention to the warning that synchronous requests may temporarily lock the browser, disabling any actions while the request is active (This may be good in your case)

Also instead of doing a 'GET' which is what $.ajax() does by default, your request should be 'HEAD' (assuming your internal webserver hasn't disabled responding to this HTTP verb). 'HEAD' is like 'GET' except it doesn't return the body of the response. So it's a good way to find out if a resource exists on a web server without having to download the entire resource

// Formerly isInside. Renamed it to reflect its function.
function isWorking(link){
    $.ajax({
        url: link,
        type: 'HEAD',
        async: false,
        success: function(){ return true; },
        error: function(){ return false; },
    })
    // If we get here, it obviously did not succeed.
    return false;
}

Python based solution:

If you don't mind preprocessing the html page (and even caching the result), I would go with parsing the HTML in Python using a library like BeautifulSoup.

Essentially I would find all the links on the page, and replace the href attribute of those starting with http or https with #. You can then use a library like requests to check the internal urls and update the appropriate urls as suggested.

vontell
  • 342
  • 3
  • 17
junnytony
  • 3,455
  • 1
  • 22
  • 24
  • I have tried BeautifulSoup as well as lxml, but in the final solution I may not have access to external libraries such as those (sorry for the unclarity; updated question to include that). – vontell Jun 12 '15 at 02:15