2

I would like to replace all non-alphanumeric characters, and replace spaces with underscores. So far I've come up with this using multiple regex which works but is there a more 'efficient' way?

"Well Done!".toLowerCase().replace(/\s/, '-').replace(/[^\w-]/gi, '');

well-done

experimenter
  • 768
  • 1
  • 9
  • 30
  • You don't need the `toLowerCase()`, and you mean dashes `-` as opposed to underscores `_`? – Jerry Aug 20 '13 at 12:58
  • You can use a function as the second parameter to decide what the replacement will be for any given match: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#Specifying_a_function_as_a_parameter. This will avoid going through the string twice. Whether this is more or less efficient than many JS function invocations for short strings is questionable. (My instinct tells me "nope", but I can't be arsed to make a jsperf.) – millimoose Aug 20 '13 at 13:00
  • 2
    Anyway, your code works so I'm not sure this is an entirely appropriate question. Like, what would make an answer "correct"? Besides it being "different", or coming in first, or that you like it for some reason. – millimoose Aug 20 '13 at 13:05
  • @millimoose it seems using the function parameter is a good option, what I was really checking was whether there was some more 'intelligent' regex that would let me do both :) – experimenter Aug 20 '13 at 13:36
  • *almost* a duplicate of [How to convert a Title to a URL slug in jQuery?](http://stackoverflow.com/q/1053902/7586) – Kobi Aug 20 '13 at 13:41
  • 1
    @htmlr It's a *different* option. I think in your case it's worse for both readability and performance. You're doing two different things, it makes enough sense to do two different calls. – millimoose Aug 20 '13 at 14:13

2 Answers2

2

At least in other languages, invoking the regular expressions engine is expensive. I'm not sure if that's true of JavaScript, but here's how you'd do it "C-style". I'm sure benchmarking its performance yourself will be a valuable learning experience.

var x = "Well Done!";
var y = "";
var c;
for (var i = 0; i < x.length; i++)
{
    c = x.charCodeAt(i);
    if (c >= 48 && c <= 57 || c >= 97 && c <= 122)
    {
        y += x[i];
    }
    else if (c >= 65 && c <=  90)
    {
        y += String.fromCharCode(c+32);
    }
    else if (c == 32 || c >= 9 && c <= 13)
    {
        y += '-';
    }
}
$('#output').html(y);

See http://www.asciitable.com/ for ASCII codes. Here's a jsFiddle. Note that I've also implemented your toLowerCase() simply by adding 32 to the uppercase letters.


Disclaimer

Personally of course, I prefer readable code, and therefore prefer regular expressions, or using some kind of a strtr function if one exists in JavaScript. This answer is purely to educate.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • Seeing as Javascript has RE literals, it might not be *that* expensive. It's possible the RE is compiled along with the JS source, not everytime it's invoked. – millimoose Aug 20 '13 at 14:13
  • Your function doesn't quite match the op's. Specifically he wants the replacement of whitespace, but you're just replacing space. `\w` replaces spaces, tabs, and line breaks. – Daniel Gimenez Aug 20 '13 at 16:17
  • @DanielGimenez - Thanks. I realized this, but figured the conversion was for creating slugs. I definitely should have mentioned it regardless. – Andrew Cheong Aug 20 '13 at 18:31
  • Turns out most of the characters considered whitespace by regular expressions are consecutively placed in ASCII; not sure whether JavaScript includes `\f` and `\v` as whitespace characters, but the answer has been updated to include all. – Andrew Cheong Aug 20 '13 at 18:51
1

Note: I thought I could come up with a faster solution with a single regex, but I couldn't. Below is my failed method (you can learn from failure), and the results of a performance test, and my conclusion.

Efficiency can be measured many ways. If you wanted to reduce the number of functions called, then you could use a single regex and a function to handle the replacement.

([A-Z])|(\s)|([^a-z\d])

REY

The first group will have toLowerCase() applied, the second will be replaced with a - and the third will return nothing. I originally used + quantifier for groups 1 and 3, but given the expected nature of the text, removing it result in faster execution. (thanks acheong87)

'Well Done!'.replace(/([A-Z])|(\s)|([^a-z\d])/g, function (match, $0, $1) {
    if ($0) return String.fromCharCode($0.charCodeAt(0) + 32);
    else if ($1) return '-';
    return '';
});

jsFiddle

Performance

My method was the worst performing:

Acheong87  fastest
Original   16% slower
Mine       53% slower

jsPerf

Conclusion

Your method is the most efficient in terms of code development time, and the performance penalty versus acheong87's method is offset by code maintainability, readability, and complexity reduction. I would use your version unless speed was of the utmost importance.

The more optional matches I added to the regular expression, the greater the performance penalty. I can't think of any advantages to my method except for the function reduction, but that is offset by the if statements and increase in complexity.

Daniel Gimenez
  • 18,530
  • 3
  • 50
  • 70
  • Nice; thanks for the benchmarks; I didn't know about this site. I wonder what the difference would be if you reversed the order of the alternation atoms (since spaces are the rarest, uppercase letters the second rarest, and lowercase letters the most common). Also, one of the reasons the above may be inefficient is that it requires backtracking. I see in _spirit_ you're trying to minimize _replacements_, but the `+` forces a "failed match" to occur before the next alternation atom is tested, whereas without the `+`, replacements are immediate. I'm not sure what the underlying code looks like – Andrew Cheong Aug 20 '13 at 18:37
  • in the JS engine, but I wonder if replacement time is more related to the number of characters rather than the number of invocations. I just edited your test to give these modifications a shot; hm, the performance benefits weren't as large as I thought they'd be. Indeed, a good example to learn from. (Now editing my answer to also include regex whitespace characters, _i.e._ `[ \f\n\r\t\v]`. – Andrew Cheong Aug 20 '13 at 18:48
  • @acheong87, you're right in both regards. The nature of the sample text and practical usage makes `+` inefficient in terms of performance. Changing the order will also help if we expect capital letters to be more frequent. – Daniel Gimenez Aug 20 '13 at 18:57