1

I am working on a URL whitelist in a browser extension. What I have currently works but I need to check the list in two places and I want to try to make it more efficient so as to reduce the chances of increased page load times.

I have to check the list in two places. The first check is in a page mod with an attached content script which is applied to all sites, the content script is changed if the url is in the whitelist. The second check is in a request observer to send different headers if the url is whitelisted.

I have tried to only check it once and pass the result from the page mod to the requst observer or from the request observer to the page mod, it results in timing issues with either the headers not being correct or the modifictaions to the content script are not applied, when they should be.

Is there a way I can improve on the substring checking code below to make it faster

I have a list of user entered sites which are sorted alphabetically before saving.

For now the format of the list is simple.

example1.com
b.example2.com/some/content.html
c.exampleN.com

and the url could be

http://example1.com/some/site/content.html

I am currently checking the if the url contains a substring with the value of each array element

//check if  a url is in the list
function listCheck(list,url){ 

    for (var i=0; i<list.length; i++){
        if (url.indexOf(list[i]) > -1)
            return true;
    }

   return false;
};
Kuser
  • 13
  • 3

2 Answers2

2
  • You can use binary search with the first letter of the URL. This will come handy because whitelists can grow pretty fast. However you cannot do this with patterns. (e.g.: *.somedomain.com)
  • Consider about using a hashtable to store the whitelist. You can make it efficient and specialized by writing your own hash function.
  • Regex will make things easier, but will also make things slow at times. If you use regex, make sure you know what you are doing. You can shrink the comparison list first by one of the methods described above.

Edit: Here's the binary search I was talking about. This is applicable only if wildcards are not used.

function binarySearch(needle, haystack, startIndex, endIndex) {
    //console.log("\ttrying to find " + needle + " between " +
    //    haystack[startIndex] + "(" + startIndex + ") and " + 
    //    haystack[endIndex] + "(" + endIndex + ")");

    // the basic case, where the list is narrowed down to 1 or 2 items
    if (startIndex == endIndex || endIndex - startIndex == 1) {
        if (haystack[startIndex] == needle)
            return startIndex;
        if (haystack[endIndex] == needle)
            return endIndex;
        return -1;
    }

    var midIndex = Math.ceil((startIndex + endIndex) / 2);
    //console.log("\t\tgot " + haystack[midIndex] + "(" + midIndex +
    //    ") for middle of the list.");

    var comparison = haystack[midIndex].localeCompare(needle);
    //console.log("\t\tcomparison: " + comparison);

    if (comparison > 0)
        return binarySearch(needle, haystack, startIndex, midIndex);
    if (comparison < 0)
        return binarySearch(needle, haystack, midIndex, endIndex);
    return midIndex; // (comparison == 0)
}

var sitelist = [ // the whitelist (the haystack).
        "alpha.com",
        "bravo.com",
        "charlie.com",
        "delta.com",
        "echo.com",
        "foxtrot.com",
        "golf.com",
        "hotel.com",
        "india.com",
        "juliet.com",
        "kilo.com",
        "lima.com",
        "mike.com",
        "november.com",
        "oscar.com",
        "papa.com",
        "quebec.com",
        "romeo.com",
        "sierra.com",
        "tango.com",
        "uniform.com",
        "victor.com",
        "whiskey.com",
        "xray.com",
        "yankee.com",
        "zulu.com"
    ];

function testBinarySearch(needle) {
    console.log("trying to find " + needle);
    var foundIndex = binarySearch(needle, sitelist, 0, sitelist.length - 1);

    if (foundIndex < 0)
        console.log(needle + " not found");
    else
        console.log(needle + " found at: " + foundIndex);
}

// note that the list is already sorted. if the list is not sorted,
// haystack.sort();

// we can find "uniform.com" using 5 comparisons, instead of 20
testBinarySearch("uniform.com");
// we can confirm the non-existance of "google.com" in 4 comparisons, not 26
testBinarySearch("google.com");
// this is an interesting (worst) case, it takes 5 comparisons, instead of 1
testBinarySearch("alpha.com");
// "zulu.com" takes 4 comparisons instead of 26
testBinarySearch("zulu.com");

When your list grows, binary search can scale very well. I would not go to other pros and cons of binary search since they are very well documented in large number of places.

More SO questions on JavaScript binary search:

Community
  • 1
  • 1
sampathsris
  • 21,564
  • 12
  • 71
  • 98
  • Thanks for getting back to me. I will try out the solutions mentioned and accept the one I find that works best for me. – Kuser Aug 01 '14 at 18:01
  • Can you give example of binary search in javascript please. – Blagoh Aug 01 '14 at 20:20
  • @Blagoh: a comment is too small to describe. This is a special scenorio where a list of URLs are sorted. So if we want to find a given URL from the list, instead of looping through every item, we can use binary search. That is, compare the item in the middle of the list with the given URL. And if it does not match, halve the list size and repeat. Check the Wikipedia article I provided for examples. – sampathsris Aug 01 '14 at 20:35
  • @krumia can you edit your solution here to include a javascript example of this halving im very curious about it as well – Noitidart Aug 01 '14 at 23:02
  • 2
    @Blagoh, @Noitidart: I fulfilled your request. remove commented `console.log`s to see how it works. – sampathsris Aug 02 '14 at 04:12
  • Thank you! This snippet is quite interesting. I never gave such a thing a lable as binary search, I learned something new today. tx – Blagoh Aug 04 '14 at 16:35
1

Using a regexp will make things easier. Whit this code you just need to make ONE comparison.

function listCheck(list, url) {
    var exp = new RegExp('(' + list.join('|') + ')');

    if (exp.test(url)) return true;
    else return false;
}

EDIT: you can get errors with symbols . or / or - in urls, so this code works better:

function listCheck(list, url) {
    var exp = new RegExp('(' + list.join('|').replace(/(\/|\.|\-)/g, '\\$1') + ')');

    if (exp.test(url)) return true;
    else return false;
}
Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
  • 1
    Just to add some clarity to Marco's answer: The vertical bar (or pipe symbol '|') is the regular expression equivalent of "or". The test() method executes a search for a match between a regular expression and a specified string. Returns true or false. – Asher G. Aug 01 '14 at 17:41
  • @Marco and mccallbear Thanks for getting back to me. I will try out the solutions mentioned and accept the one I find that works best for me. – Kuser Aug 01 '14 at 18:01