I need to write a function which tests, if given string is "blank" in a sense that it only contains whitespace characters. Whitespace characters are the following:
'\u0009',
'\u000A',
'\u000B',
'\u000C',
'\u000D',
' ',
'\u0085',
'\u00A0',
'\u1680',
'\u180E',
'\u2000',
'\u2001',
'\u2002',
'\u2003',
'\u2004',
'\u2005',
'\u2006',
'\u2007',
'\u2008',
'\u2009',
'\u200A',
'\u2028',
'\u2029',
'\u202F',
'\u205F',
'\u3000'
The function will be called a lot of times, so it must be really, really performant. But shouldn't take too much memory (like mapping every character to true/false in an array). Things I've tried out so far:
- regexp - not quite performant
- trim and check if length is 0 - not quite performant, also uses additional memory to hold the trimmed string
- checking every string character against a hash set containing whitespace characters (
if (!whitespaceCharactersMap[str[index]]) ...
) - works well enough my current solution uses hardcoded comparisons:
function(str) { var length = str.length; if (!length) { return true; } for (var index = 0; index < length; index++) { var c = str[index]; if (c === ' ') { // skip } else if (c > '\u000D' && c < '\u0085') { return false; } else if (c < '\u00A0') { if (c < '\u0009') { return false; } else if (c > '\u0085') { return false; } } else if (c > '\u00A0') { if (c < '\u2028') { if (c < '\u180E') { if (c < '\u1680') { return false; } else if(c > '\u1680') { return false; } } else if (c > '\u180E') { if (c < '\u2000') { return false; } else if (c > '\u200A') { return false; } } } else if (c > '\u2029') { if (c < '\u205F') { if (c < '\u202F') { return false; } else if (c > '\u202F') { return false; } } else if (c > '\u205F') { if (c < '\u3000') { return false; } else if (c > '\u3000') { return false; } } } } } return true; }
This seems to work 50-100% faster than hash set (tested on Chrome).
Does anybody see or know further options?
Update 1
I'll answer some of the comments here:
- It's not just checking user input for emptyness. I have to parse certain data format where whitespace must be handled separately.
- It is worth optimizing. I've profiled the code before. Checking for blank strings seems to be an issue. And, as we saw, the difference in performance between approaches can be up to 10 times, it's definitely worth the effort.
- Generally, I find this "hash set vs. regex vs. switch vs. branching" challenge very educating.
- I need the same functionality for browsers as well as node.js.
Now here's my take on performance tests:
http://jsperf.com/hash-with-comparisons/6
I'd be grateful if you guys run these tests a couple of times.
Preliminary conclusions:
- branchlessTest (
a^9*a^10*a^11...
) is extremely fast in Chrome and Firefox, but not in Safari. Probably the best choice for Node.js from performance perspective. - switchTest is also quite fast on Chrom and Firefox, but, surprizingly the slowest in Safari and Opera
- Regexps with re.test(str) perform well everywhere, even fastest in Opera.
- Hash and branching show almost identically poor results almost everywhere. Comparision is also similar, often worst performance (this may be due to the implementation, check for
' '
should be the first one).
To sum up, for my case I'll opt to the following regexp version:
var re = /[^\s]/;
return !re.test(str);
Reasons:
- branchless version is cool in Chrome and Firefox but isn't quite portable
- switch is too slow in Safari
- regexps seem to perform well everywhere, they'll also very compact in code