9

I have a hidden field on my page that stores space separated list of emails. I can have maximum 500 emails in that field.

What will be the fastest way to search if a given email already exists in that list? I need to search multiple emails in a loop

  1. use RegEx to find a match

  2. use indexOf()

  3. convert the list to a javascript dictionary and then search

If this is an exact duplicate, please let me know the other question. Thanks

EDIT: Thanks everyone for your valuable comments and answers. Basically my user has a list of emails(0-500) in db. User is presented with his own contact list. User can then choose one\more emails from his contact list to add to the list. I want to ensure at client side that he is not adding duplicate emails. Whole operation is driven by ajax, so jsvascript is required.

kheya
  • 7,546
  • 20
  • 77
  • 109
  • It would be nice if you test and post the answer here :) – Cem Kalyoncu Jun 12 '11 at 11:52
  • 3
    I would pick the last option. – Gumbo Jun 12 '11 at 11:54
  • @Gumbo: for what its worth, I would expect `indexOf()` to be the fastest. – Tim Down Jun 12 '11 at 11:56
  • Just to keep in mind you can avoid for search the part between '@' and ' ' (which is the delimiter) – eugeneK Jun 12 '11 at 11:58
  • 1
    *"I need to search multiple emails in a loop"* I bet the relationship between how many you need to search for in a loop and the total number you're searching will affect the answer as to which is "fastest." – T.J. Crowder Jun 12 '11 at 12:03
  • When I used `indexOf()` and RegEx in ActionScript3, `indexOf()` was **very** faster than `RegEx`. – JiminP Jun 12 '11 at 12:05
  • Option 4. `split` the string into an array (once) and use `Array#indexOf` (on each loop). – T.J. Crowder Jun 12 '11 at 12:06
  • @TJ, that only works out for you if you're doing some number of searches; otherwise the split takes O(length(emails)) and you still have to do some number of string comparisons -- O(number of emails) -- to find the index. If you're only doing it once, just do one string search and cut out the setup time. – Charlie Martin Jun 12 '11 at 12:37
  • @Charlie: Yeah, agreed. I was just pointing out that there's a fourth option. – T.J. Crowder Jun 12 '11 at 12:39
  • Thanks for all comments and answers. Reviewing all of them now. – kheya Jun 12 '11 at 19:26

6 Answers6

13

The answer is: It depends.

  • It depends on what you actually want to measure.
  • It depends on the relationship between how many you're searching for vs. how many you're searching.
  • It depends on the JavaScript implementation. Different implementations usually have radically different performance characteristics. This is one of the many reasons why the rule "Don't optimize prematurely" applies especially to cross-implementation JavaScript.

...but provided you're looking for a lot fewer than you have in total, it's probably String#indexOf unless you can create the dictionary once and reuse it (not just this one loop of looking for X entries, but every loop looking for X entries, which I tend to doubt is your use-case), in which case that's hands-down faster to build the 500-key dictionary and use that.

I put together a test case on jsperf comparing the results of looking for five strings buried in a string containing 500 space-delimited, unique entries. Note that that jsperf page compares some apples and oranges (cases where we can ignore setup and what kind of setup we're ignoring), but jsperf was being a pain about splitting it and I decided to leave that as an exercise for the reader.

In my tests of what I actually think you're doing, Chrome, Firefox, IE6, IE7 and IE9 did String#indexOf fastest. Opera did RegExp alternation fastest. (Note that IE6 and IE7 don't have Array#indexOf; the others do.) If you can ignore dictionary setup time, then using a dictionary is the hands-down winner.

Here's the prep code:

// ==== Main Setup
var toFind = ["aaaaa100@zzzzz", "aaaaa200@zzzzz", "aaaaa300@zzzzz", "aaaaa400@zzzzz", "aaaaa500@zzzzz"];
var theString = (function() {
 var m, n;

 m = [];
 for (n = 1; n <= 500; ++n) {
  m.push("aaaaa" + n + "@zzzzz");
 }
 return m.join(" ");
})();

// ==== String#indexOf (and RegExp) setup for when we can ignore setup
var preppedString = " " + theString + " ";

// ==== RegExp setup for test case ignoring RegExp setup time
var theRegExp = new RegExp(" (?:" + toFind.join("|") + ") ", "g");

// ==== Dictionary setup for test case ignoring Dictionary setup time
var theDictionary = (function() {
 var dict = {};
 var index;
 var values = theString.split(" ");
 for (index = 0; index < values.length; ++index) {
  dict[values[index]] = true;
 }
 return dict;
})();

// ==== Array setup time for test cases where we ignore array setup time
var theArray = theString.split(" ");

The String#indexOf test:

var index;
for (index = 0; index < toFind.length; ++index) {
 if (theString.indexOf(toFind[index]) < 0) {
  throw "Error";
 }
}

The String#indexOf (ignore setup) test, in which we ignore the (small) overhead of putting spaces at either end of the big string:

var index;
for (index = 0; index < toFind.length; ++index) {
 if (preppedString.indexOf(toFind[index]) < 0) {
  throw "Error";
 }
}

The RegExp alternation test:

// Note: In real life, you'd have to escape the values from toFind
// to make sure they didn't have special regexp chars in them
var regexp = new RegExp(" (?:" + toFind.join("|") + ") ", "g");
var match, counter = 0;
var str = " " + theString + " ";
for (match = regexp.exec(str); match; match = regexp.exec(str)) {
 ++counter;
}
if (counter != 5) {
 throw "Error";
}

The RegExp alternation (ignore setup) test, where we ignore the time it takes to set up the RegExp object and putting spaces at either end of the big string (I don't think this applies to your situation, the addresses you're looking for would be static):

var match, counter = 0;
for (match = theRegExp.exec(preppedString); match; match = theRegExp.exec(preppedString)) {
 ++counter;
}
if (counter != 5) {
 throw "Error";
}

The Dictionary test:

var dict = {};
var index;
var values = theString.split(" ");
for (index = 0; index < values.length; ++index) {
 dict[values[index]] = true;
}
for (index = 0; index < toFind.length; ++index) {
 if (!(toFind[index] in dict)) {
  throw "Error";
 }
}

The Dictionary (ignore setup) test, where we don't worry about the setup time for the dictionary; note that this is different than the RegExp alternation (ignore setup) test because it assumes the overall list is invariant:

var index;
for (index = 0; index < toFind.length; ++index) {
 if (!(toFind[index] in theDictionary)) {
  throw "Error";
 }
}

The Array#indexOf test (note that some very old implementations of JavaScript may not have Array#indexOf):

var values = theString.split(" ");
var index;
for (index = 0; index < toFind.length; ++index) {
 if (values.indexOf(toFind[index]) < 0) {
  throw "Error";
 }
}

The Array#indexOf (ignore setup) test, which like Dictionary (ignore setup) assumes the overall list is invariant:

var index;
for (index = 0; index < toFind.length; ++index) {
 if (theArray.indexOf(toFind[index]) < 0) {
  throw "Error";
 }
}
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • Seems like for the dictionary test, the work of putting the strings into the dictionary might not count; if the set of strings is fairly static, that's a one-time operation independent of the search process. – Pointy Jun 12 '11 at 12:38
  • Almost a winner. From the description, though, it seems like he might be looking up one email once; it'd be useful to compare single searches to see where the trade for setup time happens. – Charlie Martin Jun 12 '11 at 12:40
  • @Pointy, that's the same point I made in my answer to start with, and it's the core question people fail to ask in most "performance" problems: what's the workload? One setup with 500 searches will have a different optimal answer than (setup,search)+. – Charlie Martin Jun 12 '11 at 12:42
  • @Pointy, @Charlie: Yeah, my understanding was that he wanted to know what the fastest *overall* operation was, but if setup time isn't an issue and he wants to know what's fastest after setup, that would be different. – T.J. Crowder Jun 12 '11 at 12:42
  • @Pointy, @Charlie: Added some "ignore setup" test cases as well. Also switched to using `in` to test the dictionary, *slightly* more efficient and safe with the data set the question postulates. – T.J. Crowder Jun 12 '11 at 13:01
  • Your first example fails when the address is just a substring like `bar@example.com` is a substring of `foobar@example.com`. The same applies for the regular expression example, although is requires a word boundary like `/\bbar@example\.com\b/.test("foo-bar@example.com")` (not to mention that the addresses need to be quoted properly). – Gumbo Jun 12 '11 at 13:10
  • 1
    Fixed substring search and went another way on regexp. And now I really must go do something else. :) – T.J. Crowder Jun 12 '11 at 13:24
  • +1 for the jsperf, I'd love to see how this does on Node – qodeninja Oct 07 '14 at 01:11
5

Instead of looking for the fastest solution, you first need to make sure that you’re actually having a correct solution. Because there are four cases an e-mail address can appear and a naive search can fail:

  1. Alone: user@example.com
  2. At the begin: user@example.com ...
  3. At the end: ... user@example.com
  4. In between: ... user@example.com ...

Now let’s analyze each variant:

  1. To allow arbitrary input, you will need to escape the input properly. You can use the following method to do so:

    RegExp.quote = function(str) {
        return str.toString().replace(/(?=[.?*+^$[\]\\(){}-])/g, "\\");
    };
    

    To match all four cases, you can use the following pattern:

    /(?:^|\ )user@example\.com(?![^\ ])/
    

    Thus:

    var inList = new RegExp("(?:^| )" + RegExp.quote(needle) + "(?![^ ])").test(haystack);
    
  2. Using indexOf is a little more complex as you need to check the boundaries manually:

    var pos = haystack.indexOf(needle);
    if (pos != -1 && (pos != 0 && haystack.charAt(pos-1) !== " " || haystack.length < (pos+needle.length) && haystack.charAt(pos+needle.length) !== " ")) {
        pos = -1;
    }
    var inList = pos != -1;
    
  3. This one is rather quite simple:

    var dict = {};
    haystack.match(/[^\ ]+/g).map(function(match) { dict[match] = true; });
    var inList = dict.hasOwnProperty(haystack);
    

Now to test what variant is the fastest, you can do that at jsPerf.

Aurovrata
  • 2,000
  • 27
  • 45
Gumbo
  • 643,351
  • 109
  • 780
  • 844
1

indexOf() is most probably the fastest just keep in mind you need to search for two possible cases:

var existingEmails = "email1, email2, ...";
var newEmail = "somethingHere@email.com";
var exists = (existingEmails.indexOf(newEmail + " ") >= 0) || (existingEmails.indexOf(" " + newEmail ) > 0);
Shadow The GPT Wizard
  • 66,030
  • 26
  • 140
  • 208
  • Existing email is space separated. why not add ' ' to the end of new Email and existing emails. That way we can just search for one case only. Your code not using newEmail – kheya Jun 12 '11 at 19:49
  • Because what if the email is first or last? Regarding `newEmail` good catch, my bug.. – Shadow The GPT Wizard Jun 12 '11 at 19:54
  • How about searching ' ' + newEmail + ' ' to get rid of one of the indexOf call? Existing emails would need to have ' ' at beginning and end – kheya Jun 12 '11 at 20:45
0

I know this is an old question, but here goes an answer for those who might need in the future.

I made some tests and the indexOf() method is impossibly fast!

Tested the case on Opera 12.16 and it took 216µs to search and possibly find something.

Here is the code used:

console.time('a');
var a=((Math.random()*1e8)>>0).toString(16);
for(var i=0;i<1000;++i)a=a+' '+((Math.random()*1e8)>>0).toString(16)+((Math.random()*1e8)>>0).toString(16)+((Math.random()*1e8)>>0).toString(16)+((Math.random()*1e8)>>0).toString(16);
console.timeEnd('a');
console.time('b');
var b=(' '+a).indexOf(((Math.random()*1e8)>>0).toString(16));
console.timeEnd('b');
console.log([a,b]);

In the console you will see a huge output.

The timer 'a' counts the time taken to make the "garbage", and the timer 'b' is the time to search for the string.

Just adding 2 spaces, one before and one after, on the email list and adding 1 space before and after the email, you are set to go.

I use it to search for a class in an element without jQuery and it works pretty fast and fine.

Ismael Miguel
  • 4,185
  • 1
  • 31
  • 42
0

You're asking a question with too many unstated variables for us to answer. For example, how many times do you expect to perform this search? only once? A hundred times? Is this a fixed list of emails, or does it change every time? Are you loading the emails with the page, or by AJAX?

IF you are performing more than one search, or the emails are loaded with the page, then you are probably best off creating a dictionary of the names, and using the Javascript in operator.

If you get the string from some off-page source, and you only search it once, then indexOf may well be better.

In all cases, if you really care about the speed, you're best off making a test.

But then I'd ask "Why do you care about the speed?" This is a web page, where loading the page happens at network speeds; the search happens at more or less local-processor speed. It's very unlikely that this one search will make a perceptible difference in the behavior of the page.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • The `in` operator is not a good idea: `"indexOf" in []` yields true. Better use `hasOwnProperty`. – Gumbo Jun 12 '11 at 12:17
  • Yes, @Gumbo, if you completely violate the assumptions of the questioner AND my answer, `in` doesn't work. In fact, there is a countable infinity of different programs that don't do what I said that won't work. – Charlie Martin Jun 12 '11 at 12:32
  • @Gumbo: The question states these are email addresses. There are no default properties on `Object.prototype` containing the `@` character, so `in` would be fine. – T.J. Crowder Jun 12 '11 at 12:40
0

Here is a little explanation:

Performing a dictionary lookup is relatively complicated - very fast compared with (say) a linear lookup by key when there are lots of keys, but much more complicated than a straight array lookup. It has to calculate the hash of the key, then work out which bucket that should be in, possibly deal with duplicate hashes (or duplicate buckets) and then check for equality.

As always, choose the right data structure for the job - and if you really can get away with just indexing into an array (or List) then yes, that will be blindingly fast.

The above has been taken from one of the blog posts of @Jon Skeet.

Hasan Fahim
  • 3,875
  • 1
  • 30
  • 51