9

I need to write a function which tests, if given string is "blank" in a sense that it only contains whitespace characters. Whitespace characters are the following:

'\u0009',
'\u000A',
'\u000B',
'\u000C',
'\u000D',
' ',
'\u0085',
'\u00A0',
'\u1680',
'\u180E',
'\u2000',
'\u2001',
'\u2002',
'\u2003',
'\u2004',
'\u2005',
'\u2006',
'\u2007',
'\u2008',
'\u2009',
'\u200A',
'\u2028',
'\u2029',
'\u202F',
'\u205F',
'\u3000'

The function will be called a lot of times, so it must be really, really performant. But shouldn't take too much memory (like mapping every character to true/false in an array). Things I've tried out so far:

  • regexp - not quite performant
  • trim and check if length is 0 - not quite performant, also uses additional memory to hold the trimmed string
  • checking every string character against a hash set containing whitespace characters (if (!whitespaceCharactersMap[str[index]]) ...) - works well enough
  • my current solution uses hardcoded comparisons:

    function(str) {
        var length = str.length;
        if (!length) {
            return true;
        }
        for (var index = 0; index < length; index++)
        {
            var c = str[index];
            if (c === ' ')
            {
                // skip
            }
            else if (c > '\u000D' && c < '\u0085')
            {
                return false;
            }
            else if (c < '\u00A0')
            {
                if (c < '\u0009')
                {
                    return false;
                }
                else if (c > '\u0085')
                {
                    return false;
                }
            }
            else if (c > '\u00A0')
            {
                if (c < '\u2028')
                {
                    if (c < '\u180E')
                    {
                        if (c < '\u1680')
                        {
                            return false;
                        }
                        else if(c > '\u1680')
                        {
                            return false;
                        }
                    }
                    else if (c > '\u180E')
                    {
                        if (c < '\u2000')
                        {
                            return false;
                        }
                        else if (c > '\u200A')
                        {
                            return false;
                        }
                    }
                }
                else if (c > '\u2029')
                {
                    if (c < '\u205F')
                    {
                        if (c < '\u202F')
                        {
                            return false;
                        }
                        else if (c > '\u202F')
                        {
                            return false;
                        }
                    }
                    else if (c > '\u205F')
                    {
                        if (c < '\u3000')
                        {
                            return false;
                        }
                        else if (c > '\u3000')
                        {
                            return false;
                        }
                    }
                }
            }
        }
        return true;
    }
    

This seems to work 50-100% faster than hash set (tested on Chrome).

Does anybody see or know further options?

Update 1

I'll answer some of the comments here:

  • It's not just checking user input for emptyness. I have to parse certain data format where whitespace must be handled separately.
  • It is worth optimizing. I've profiled the code before. Checking for blank strings seems to be an issue. And, as we saw, the difference in performance between approaches can be up to 10 times, it's definitely worth the effort.
  • Generally, I find this "hash set vs. regex vs. switch vs. branching" challenge very educating.
  • I need the same functionality for browsers as well as node.js.

Now here's my take on performance tests:

http://jsperf.com/hash-with-comparisons/6

I'd be grateful if you guys run these tests a couple of times.

Preliminary conclusions:

  • branchlessTest (a^9*a^10*a^11...) is extremely fast in Chrome and Firefox, but not in Safari. Probably the best choice for Node.js from performance perspective.
  • switchTest is also quite fast on Chrom and Firefox, but, surprizingly the slowest in Safari and Opera
  • Regexps with re.test(str) perform well everywhere, even fastest in Opera.
  • Hash and branching show almost identically poor results almost everywhere. Comparision is also similar, often worst performance (this may be due to the implementation, check for ' ' should be the first one).

To sum up, for my case I'll opt to the following regexp version:

var re = /[^\s]/;
return !re.test(str);

Reasons:

  • branchless version is cool in Chrome and Firefox but isn't quite portable
  • switch is too slow in Safari
  • regexps seem to perform well everywhere, they'll also very compact in code
lexicore
  • 42,748
  • 17
  • 132
  • 221
  • I don't know much about js...how about a regular expression may be? – pinkpanther Jun 09 '13 at 13:51
  • 7
    You can spend a lot of time optimizing things which are not worth it. Simple test `string.match(/^\s*$/')` will do the job perfectly in [very short time](http://jsfiddle.net/RxrBA/)... – Tomas Jun 09 '13 at 13:57
  • 2
    You can also use `if (line.trim().isEmpty()) { // line is "blank" }` from http://stackoverflow.com/questions/3012788/how-to-check-if-a-line-is-blank-using-regex – pinkpanther Jun 09 '13 at 13:57
  • trim() is not good since it creates a trimmed version of string. Which I don't need, even temporarily. – lexicore Jun 09 '13 at 14:02
  • 4
    Can you give the source of `50-100% faster`? My experiments show [the opposite](http://jsperf.com/hash-with-comparisons). You didn't recreate that hash with each function call, right? – raina77ow Jun 09 '13 at 14:06
  • 2
    @Tomas If he wants it hyperoptimized, regex is hardly the way. Even then, `var re = /^\s*$/;` somewhere in the main body and then `string.match(re)` where he needs it will be significantly better, as it doesn't recreate the regex each time it is called. – Vedran Šego Jun 09 '13 at 14:07
  • According to Unicode 6.2 and ECMAScript 5 implementation of regular expressions, only `\uFEFF` is detected when it shouldn't, and `\u0085` not detected when it should be. – Qantas 94 Heavy Jun 09 '13 at 14:25
  • It doesn't have some of the test cases you've tried yet, but plugging these into [jsPerf - Check if a string contains only whitespace](http://jsperf.com/check-if-a-string-contains-only-whitespace/2) will help others check your algorithms and test cases. It'll also help anyone else coming along for a similar issue. – dc5 Jun 09 '13 at 14:27
  • 1
    Are you really expecting users to hunt down obscure whitespace characters to fill in your fields and thus put your utility to work?... Or are you really just checking if a field has content this passing a *required* field check? This feels like you're trying to solve a problem the hard way just to cover that 0.001% corner case where someone is trying to stuff garbage in. My guess is that it would be easier to clean this up later with a quick and simple SQL query after the fact... Every 6 months or so. – scunliffe Jun 09 '13 at 14:33
  • 1
    @scunliffe +1, very practical advice. What interests me is the reasoning behind extra requirements for this particular function to be speedy: it will be called _a lot of times_. But if one deals with, like, zounds of _different_ strings (in a single page context), something tells me this particular method is the last potential bottleneck. – raina77ow Jun 09 '13 at 14:41
  • @Tomas: Though, caching the regex and using `test` is considerably faster: http://jsfiddle.net/RxrBA/1/ – Bergi Jun 09 '13 at 15:14

3 Answers3

7

Hard-coded solution seems the best, but I think switch should be faster. It depends on the way JavaScript interpreter handles these (most compilers do this very efficiently), so it may be browser-specific (i.e., fast in some, slow in others). Also, I'm not sure how fast JavaScript is with UTF-strings, so you might try converting a character to its integer code before comparing the values.

for (var index = 0; index < length; index++)
{
    var c = str.charCodeAt(index);
    switch (c) {
        case 0x0009: case 0x000A: case 0x000B: case 0x000C: case 0x000D: case 0x0020:
        case 0x0085: case 0x00A0: case 0x1680: case 0x180E: case 0x2000: case 0x2001:
        case 0x2002: case 0x2003: case 0x2004: case 0x2005: case 0x2006: case 0x2007:
        case 0x2008: case 0x2009: case 0x200A: case 0x2028: case 0x2029: case 0x202F:
        case 0x205F: case 0x3000: continue;
    }
    return false;
}

Another thing to consider is changing for:

for (var index in str)
{
    ...
}

Edit

Your jsPerf test got some revisions, the current one available here. My code is significantly faster in Chrome 26 and 27, and in IE10, but it's also the slowest one in Firefox 18.

I ran the same test (I don't know how to make jsPerf save those) on Firefox 20.0 on 64-bit Linux and it turned out to be one of the two fastest ones (tied with trimTest, both at about 11.8M ops/sec). I also tested Firefox 20.0.1 on WinXP, but under a VirtualBox (still under 64bit Linux, which might make a significant difference here), which gave 10M ops/sec to switchTest, with trimTest coming second at 7.3M ops/sec.

So, I'm guessing that the performance depends on the browser version and/or maybe even on the underlying OS/hardware (I suppose the above FF18 test was on Win). In any case, to make a truly optimal version, you'll have to make many versions, test each on all browsers, OSes, architectures,... you can get a hold of, and then include in your page the version best suited for the visitor's browser, OS, architecture,... I'm not sure what kind of code is worth the trouble, though.

Vedran Šego
  • 3,553
  • 3
  • 27
  • 40
5

Since branching is much more expensive than most other operations, you want to keep branches to a minimum. Thus, your sequence of if/else statements may not be very performant. A method which instead uses mostly math would be a lot faster. For example:

One way of performing an equality check without using any branching is to use bitwise operations. One example is, to check that a == b:

a ^ b == 0

Since the xor of two similar bits (ie, 1 ^ 1 or 0 ^ 0) is 0, xor-ing two equal values produces 0. This is useful because it allows us to treat 0 as a "true" value, and do more math. Imagine that we have a bunch of boolean variables represented in this way: nonzero numbers are false, and zero means true. If we want to ask, "is any of these true?" we simply multiply them all together. If any of them were true (equal to zero), the entire result would be zero.

So, for example, the code would look something like this:

function(str) {
    for (var i = 0; i < str.length; i++) {
        var c = str[i];
        if ((c ^ '\u0009') * (c ^ '\u000A') * (c ^ '\u000B') ... == 0)
            continue;
        return false;
    }
    return true;
}

The primary reason that this would be more performant than simply doing something like:

if ((c == '\u0009') || (c == '\u000A') || (c == '\u0008') ...)

is that JavaScript has short-circuit boolean operators, meaning that every time the || operator is used, it not only performs the or operation, but also checks to see if it can prove that the statement must be true thus far, which is a branching operation, which is expensive. The math approach, on the other hand, involves no branching, except for the if statement itself, and should thus be much faster.

joshlf
  • 21,822
  • 11
  • 69
  • 96
  • 2
    It sounds quite nice; [practice](http://jsperf.com/hash-with-comparisons) is far less convincing. – raina77ow Jun 09 '13 at 14:26
  • You mean the stuff about short-circuited logic and branching being expensive? Yes. It may be a bit more complex in a high-level language like JS, but it's definitely true of code running directly on the processor (with assembly-language comparison and conditional jump instructions). I imagine that it's probably true of JS in a compiled context as well, and Chrome compiles its JS, so probably. It's certainly not any slower. – joshlf Jun 09 '13 at 14:27
  • It IS slower. Making assumptions about how JIT languages work based on processor logic won't take you far. – raina77ow Jun 09 '13 at 14:29
  • Huh, interesting, @raina77ow; I wouldn't have expected that. I wonder if it's a compiler optimization, or perhaps the short-circuiting is saving enough operations to make it worthwhile. Or maybe it does branching in either case and isn't close enough to the hardware to make a difference. – joshlf Jun 09 '13 at 14:32
  • "Yes. It may be a bit more complex in a high-level language like JS, but it's definitely true of code running directly on the processor (with assembly-language comparison and conditional jump instructions). I imagine that it's probably true of JS in a compiled context as well, and Chrome compiles its JS, so probably." Your imagination and guesses are not backed by expertise. -1 for posting a detailed analysis in the wrong language and not justifying why it applies here, beyond guessing. – djechlin Jun 09 '13 at 15:28
  • 2
    The theory is right and this will perform by far the fastest [if `Math.imul` is added](https://code.google.com/p/v8/issues/detail?id=2455). The problem for now is that `*` cannot do integer multiplication properly in this case. Here's incorrect code that performs well (by far the fastest infact) because there is no overhead of constantly converting between integers and doubles http://jsperf.com/hash-with-comparisons/5. The only difference is that v8 and spidermonkey can infer integer multiplication correctly in this situation. – Esailija Jun 10 '13 at 07:34
  • 2
    If JavaScript has bit xor (^), surely it has bit and (&?). So, "and" the xor-ed elements together rather than multiplying. – Ira Baxter Jun 15 '13 at 21:45
  • 1
    Good thinking, @IraBaxter. One modification, though: You'd want to or (|) the elements together. True is represented by 0, and false by 1, so to see if any of them is false, or them all together, and you'll see if there was a 1 bit anywhere in the chain. – joshlf Jun 19 '13 at 20:12
  • 1
    Yep, you were right, @IraBaxter. I just ran the test with bitwise-or, and it went from ~20k ops/sec to ~22k ops/sec. Right on :). http://jsperf.com/hash-with-comparisons/8 – joshlf Jun 19 '13 at 20:20
  • Glad to help, in a backwards kind of way :-} – Ira Baxter Jun 19 '13 at 22:02
0

This creates and uses a 'hash' lookup on the characters of the string, if it detects a non-whitespace then returns false:

var wsList=['\u0009','\u000A','\u000B','\u000C','\u000D',' ','\u0085','\u00A0','\u1680','\u180E','\u2000','\u2001','\u2002','\u2003','\u2004','\u2005','\u2006','\u2007','\u2008','\u2009','\u200A','\u2028','\u2029','\u202F','\u205F','\u3000'];
var ws=Object.create(null);
wsList.forEach(function(char){ws[char]=true});
function isWhitespace(txt){
    for(var i=0, l=txt.length; i<l; ++i){
        if(!ws[txt[i]])return false;
    }
    return true;
}

var test1=" \u1680 \u000B \u2002 \u2004";
isWhitespace(test1);
/*
true
*/
var test2=" _ . a ";
isWhitespace(test2);
/*
false
*/

Not sure about it's performance (yet). After a quick test on jsperf, it turns out to be quite slow compared to RegExp using /^\s*$/.


edit:

It appears that the solution you should go with might likely depend on the nature of the data you are working with: Is the data mostly whitespace or mostly non-whitespace? Also mostly ascii-range text? You might be able to speed it up for average test cases by using range checks (via if) for common non-whitespace character ranges, using switch on the most common whitespace, then using a hash lookup for everything else. This will likely improve average performance of the tests if most of the data being tested is comprised of the most common characters (between 0x0--0x7F).

Maybe something like this (a hybrid of if/switch/hash) could work:

/*same setup as above with variable ws being a hash lookup*/
function isWhitespaceHybrid(txt){
    for(var i=0, l=txt.length; i<l; ++i){
        var cc=txt.charCodeAt(i)
        //above space, below DEL
        if(cc>0x20 && cc<0x7F)return false;
        //switch only the most common whitespace
        switch(cc){
            case 0x20:
            case 0x9:
            case 0xA:
            case 0xD:
            continue;
        }
        //everything else use a somewhat slow hash lookup (execute for non-ascii range text)
        if(!ws[txt[i]])return false;
    }
    return true;
}
jongo45
  • 3,020
  • 2
  • 16
  • 12
  • Why would you belive a switch statement is faster than an array index operation? I'd be really surprised if your "hybrid" was faster than your original lookup table. I'd also be tempted to make the lookup table dense, that is, fill in *all* entries with true or false. The JS interpreter may be implementing the array behind the scenes with a discrimination net. (It may be using a hash table, too, at which point your solution is probably best). – Ira Baxter Jun 15 '13 at 21:11
  • @Ira Baxter. I think you are correct about `indexOf` being much faster. Switch statements have quite a bit more operations going on than most people realize in order to adhere to standards: ([ecmascript:switch statement](http://www.ecma-international.org/ecma-262/5.1/#sec-12.11)). Regarding the hash table, it's still a relatively expensive operation, but for normal ascii-range text the hash lookup shouldn't occur (hence the `if` and `switch` were use to capture the majority of cases). – jongo45 Jun 15 '13 at 21:38
  • "indexOf"? I said nothing about indexOf. I simply expected a direct lookup in a dense array to pretty fast, and it contains no branches (or other implied conversions/shenanigans on the part of switch). – Ira Baxter Jun 15 '13 at 21:49
  • @Ira Baxter That's the catch, If I'm not mistaken, getting an element from an array via index isn't any different than getting a property from an object via key as far as performance is concerned (might depend on browser). It's a 'feature' of javascript. – jongo45 Jun 15 '13 at 22:22
  • PHP was like that, too, until they (long ago) figured out that array indexing was worth doing right for performance reasons. In the grand scheme of things, this is a pretty small change to the Javascript interpreter/jitter with quite a big payoff, esp. considering growing interest in Javascript performance these days. Are you *sure* that array index on *dense* arrays is done by doing property lookup (e.g., "hashing") on the array index value? – Ira Baxter Jun 15 '13 at 22:36