RegEx word performance: \w vs [a-zA-Z0-9_]

Question

I'd like to know the list of chars that \w passes, is it just [a-zA-Z0-9_] or are there more chars that it might cover?

I'm asking this question, because based on this, \d is different with [0-9] and is less efficient.

\w vs [a-zA-Z0-9_]: which one might be faster in large scale?

want to know the performance difference between 2 pieces of code, benchmark them — , Apr 16 '19 at 01:43
start timer, loops 10k times, stop timer, compare for each, run each test X number of cycles. https://www.php.net/manual/en/function.microtime.php examples show how to set up the timming — , Apr 16 '19 at 01:46
Take BenchmarkDotNet library https://github.com/dotnet/BenchmarkDotNet, write tests and compare https://aakinshin.net/posts/stephen-toub-benchmarks-part1/ — Backs, Apr 16 '19 at 01:52
it uses microtime, as i suggested above, so dont see your objection to it — , Apr 16 '19 at 02:10
If your question is about how to test the performance of X, then ask that. — MineR, Apr 16 '19 at 05:49

ikegami · Accepted Answer · 2019-04-16T09:01:05.227

[ This answer is Perl-specific. The information within may not apply to PCRE or the engine used by the other languages tagged. ]

/\w/aa (the actual equivalent of /[a-zA-Z0-9_]/) is usually faster, but not always. That said, the difference is so minimal (less than 1 nanosecond per check) that it shouldn't be a concern. To put it in to context, it takes far, far longer to call a sub or start the regex engine.

What follows covers this in detail.

First of all, \w isn't the same as [a-zA-Z0-9_] by default. \w matches every alphabetic, numeric, mark and connector punctuation Unicode Code Point. There are 119,821 of these!^[1] Determining which is the fastest of non-equivalent code makes no sense.

However, using \w with /aa ensures that \w only matches [a-zA-Z0-9_]. So that's what we're going to be using for our benchmarks. (Actually, we'll use both.)

(Note that each test performs 10 million checks, so a rate of 10.0/s actually means 10.0 million checks per second.)

ASCII-only positive match
               Rate [a-zA-Z0-9_]      (?u:\w)     (?aa:\w)
[a-zA-Z0-9_] 39.1/s           --         -26%         -36%
(?u:\w)      52.9/s          35%           --         -13%
(?aa:\w)     60.9/s          56%          15%           --

When finding a match in ASCII characters, ASCII-only \w and Unicode \w both beat the explicit class.

/\w/aa is ( 1/39.1 - 1/60.9 ) / 10,000,000 = 0.000,000,000,916 s faster on my machine

ASCII-only negative match
               Rate      (?u:\w)     (?aa:\w) [a-zA-Z0-9_]
(?u:\w)      27.2/s           --          -0%         -12%
(?aa:\w)     27.2/s           0%           --         -12%
[a-zA-Z0-9_] 31.1/s          14%          14%           --

When failing to find a match in ASCII characters, the explicit class beats ASCII-only \w.

/[a-zA-Z0-9_]/ is ( 1/27.2 - 1/31.1 ) / 10,000,000 = 0.000,000,000,461 s faster on my machine

Non-ASCII positive match
               Rate      (?u:\w) [a-zA-Z0-9_]     (?aa:\w)
(?u:\w)      2.97/s           --        -100%        -100%
[a-zA-Z0-9_] 3349/s      112641%           --          -9%
(?aa:\w)     3664/s      123268%           9%           --

Whoa. This tests appears to be running into some optimization. That said, running the test multiple times yields extremely consistent results. (Same goes for the other tests.)

When finding a match in non-ASCII characters, ASCII-only \w beats the explicit class.

/\w/aa is ( 1/3349 - 1/3664 ) / 10,000,000 = 0.000,000,000,002,57 s faster on my machine

Non-ASCII negative match
               Rate      (?u:\w) [a-zA-Z0-9_]     (?aa:\w)
(?u:\w)      2.66/s           --          -9%         -71%
[a-zA-Z0-9_] 2.91/s          10%           --         -68%
(?aa:\w)     9.09/s         242%         212%           --

When failing to find a match in non-ASCII characters, ASCII-only \w beats the explicit class.

/[a-zA-Z0-9_]/ is ( 1/2.91 - 1/9.09 ) / 10,000,000 = 0.000,000,002,34 s faster on my machine

Conclusions

I'm surprised there's any difference between /\w/aa and /[a-zA-Z0-9_]/.
In some situation, /\w/aa is faster; in others, /[a-zA-Z0-9_]/.
The difference between /\w/aa and /[a-zA-Z0-9_]/ is very minimal (less than 1 nanosecond).
The difference is so minimal that you shouldn't be concerned about it.
Even the difference between /\w/aa and /\w/u is quite small despite the latter matching 4 orders of magnitude more characters than the former.

use strict;
use warnings;
use feature qw( say );

use Benchmarks qw( cmpthese );

my %pos_tests = (
   '(?u:\\w)'     => '/^\\w*\\z/u',
   '(?aa:\\w)'    => '/^\\w*\\z/aa',
   '[a-zA-Z0-9_]' => '/^[a-zA-Z0-9_]*\\z/',
);

my %neg_tests = (
   '(?u:\\w)'     => '/\\w/u',
   '(?aa:\\w)'    => '/\\w/aa',
   '[a-zA-Z0-9_]' => '/[a-zA-Z0-9_]/',
);

$_ = sprintf( 'use strict; use warnings; our $s; for (1..1000) { $s =~ %s }', $_)
   for
      values(%pos_tests),
      values(%neg_tests);

local our $s;

say "ASCII-only positive match";
$s = "J" x 10_000;
cmpthese(-3, \%pos_tests);

say "";

say "ASCII-only negative match";
$s = "!" x 10_000;
cmpthese(-3, \%neg_tests);

say "";

say "Non-ASCII positive match";
$s = "\N{U+0100}" x 10_000;
cmpthese(-3, \%pos_tests);

say "";

say "Non-ASCII negative match";
$s = "\N{U+2660}" x 10_000;
cmpthese(-3, \%neg_tests);

Unicode version 11.

I don't think it will make a difference, but `/a` is sufficient to make `\w` exactly equivalent to `[a-zA-Z0-9_]` as long as you're not using `/i`. — Grinnz, Apr 16 '19 at 15:58
@Grinnz, I kept it simple. I said that "using `\w` with `/aa` ensures that `\w` only matches `[a-zA-Z0-9_]`", a claim that can't be made for `/a` without going off-topic. — ikegami, Apr 16 '19 at 16:14

zdim · Answer 2 · 2019-04-28T05:15:52.597

2

This answer is based on Perl but all tagged tools should be very similar in the following.

The \w character class (for a "word" character) follows Unicode specs for character properties of a "word." This includes so much stuff and complexity that it is a challenge to specify the categories of included properties. See "Word characters" in perlrecharclass, and this post for instance. See perlunicode and perluniprops for background.

In short, it's way beyond the 63 ascii chars, unless /a (or /aa) modifier or locales are used.

However, the question is specifically about performance. At this point different tools should be expected to diverge in behavior, and possibly a lot, since this depends on regex implementation. The rest of this post is specific for Perl.

One may expect that a smaller set may be faster to check for, or one may expect that constructs like \w come with optimizations. Instead of guessing let us measure. The following is a crude benchmark aiming for reasonable findings, leaving out a few nuances.

use warnings;
use strict;
use feature 'say';

use List::Util qw(shuffle);
use Benchmark qw(cmpthese);

my $run_for = shift // 3;  # seconds to run benchmark for

my $str = join '', (shuffle 'a'..'z', 'A'..'Z', 0..9, '_') x 100;

sub word_class {
    my $str = shift;
    my @m_1 = $str =~ /\w/g;
    return \@m_1;
}

sub char_class {
    my $str = shift;
    my @m_2 = $str =~ /[a-zA-Z0-9_]/g;
    return \@m_2;
}


cmpthese(-$run_for, {
    word => sub { my $res = word_class ($str) },
    char => sub { my $res = char_class ($str) },
});

A string is assembled using [a-zA-Z0-9_] which are shuffled and then repeated 100 times. That whole string is matched, character by character under /g, by \w and by [a-zA-Z0-9_]. So it's a single regex in each case and these are benchmarked.

The result

      Rate char word
char 583/s   --  -1%
word 587/s   1%   --

The numbers above go up to 2% either way in various runs in my tests. So no difference.

Note: I have tried with non-ascii characters added to the test string, with no discernable difference.

Note: The regex with /g accumulates matches (6300) char after char, but in a single engine run. The other option is to check for a single match repeatedly. These are not the same but regardless both will expose a difference in performance between \w and [a-zA-Z0-9_] if it is considerable.

Please time it for yourself, with string and patterns better suited for your circumstances.

The above benchmark was meant to be a basic, rough measure. However, notably missing are negative (failing) matches, whereby the engine is expected to go through all possibilities for tested patterns.

I test for that by invoking the benchmarked routines above on the target string changed to

$str = join '', qw(! / \ { } ^ % @) x 1_000;

which will fail to match under both \w and [a-zA-Z0-9_]. The result

        Rate char word
char 72820/s   -- -19%
word 89863/s  23%   --

This is a surprise to me, to say the least. The \w set is so much greater (see ikegami answer) that this must imply the there are heavy (or "magical") optimizations going on.

This enforces my overall conclusion: Performance of these is close enough in general, so simply use what is more suitable coding wise; Or, time it in your specific use case.

edited Apr 28 '19 at 05:15

answered Apr 16 '19 at 06:34

zdim

64,580
5
52
81

This is a rather poor benchmark. It only tests one kind of data, it only checks positive matches, and it doesn't really check the differences between the two things being benchmarked because they're lost in the noise of the time needed to call a sub and start the regex engine . – ikegami Apr 16 '19 at 08:45
Re "*[The difference is] lost in the noise of the time needed to call a sub and start the regex engine*", I suppose this is meaningful. It speaks to how little does optimizing this matter. – ikegami Apr 16 '19 at 10:49
@ikegami Re "_rather poor_" -- I absolutely disagree. It wasn't meant to be exhaustive but rather a reasonable rough take; it's fully documented -- I state _exactly_ what/how is tested. The finding ("it doesn't matter") _is_ reasonable. (2) "_only tests one kind of data_" -- I tried w/ non-ascii data and it made no difference; elaborating on that would've been against its "_rought take_" nature (3) "_only checks positive matches_" -- it checks through a string of nearly 10K mixed up ascii; there's plenty of misses. Again, it's a rough take. (4) "_lost in the noise_" -- so it doesn't matter – zdim Apr 16 '19 at 17:20
@ikegami For the record: I appreciate your detailed answer (great as always), and _absolutely appreciate_ your feedback/criticism on this one. I just disagree with the "poor" qualification; I think it's useful as it stands. – zdim Apr 16 '19 at 17:41
@ikegami One thing I don't understand though, with "_lost in the noise of the time needed to call a sub and start the regex engine_" (1) it's got to run them to benchmark (?) ... do you mean that benchmark with expression (rather than sub) is that much quicker? (2) It runs through all chars in the string with `/g` -- that doesn't restart the engine? And isn't that a reasonable way to test more than one match? – zdim Apr 16 '19 at 17:45
Re "*It wasn't meant to be exhaustive*", For all you know, it's a million times slower at failing to matching. You gotta test both. This isn't about being exhaustive; it's about you having no idea what kind of data the OP has, and not documenting your massive assumptions. By your logic (only testing matches), Perl is an extremely slow regex engine. While in fact, it performs quite well because many of the optimization speed up *failed* matches. That's why it's important to test. – ikegami Apr 16 '19 at 18:14
Re "*- it checks through a string of nearly 10K mixed up ascii;*", uh, not it doesn't. Every single character in the string matches. – ikegami Apr 16 '19 at 18:15
By your logic (only testing matches), Perl is an extremely slow regex engine. While in fact, it performs quite well because many of the optimization speed up *failed* matches – ikegami Apr 16 '19 at 18:17
Re "*I just disagree with the "poor" qualification*", Well, your code doesn't even show that one is actually 10% faster than the other... Benchmarks are trivially easy to do wrong. – ikegami Apr 16 '19 at 18:19
@ikegami "_code doesn't even show that one is actually 10% faster than the other_" --- um, well ... it isn't ? Both `\w` and `[...]` match through a long string with same rates; why do you say that one is "actually 10% faster"? [ I did complete the negative test, and they aren't the same ... the `\w` is 23% faster ?? Checking...] – zdim Apr 16 '19 at 18:28
Like I said, your code compares calling a sub + starting the regex engine + matching one char using one method vs sub + calling a starting the regex engine + matching one char using the other method. There's so much overhead that you bury the difference in noise. – ikegami Apr 16 '19 at 18:31
@ikegami "_uh, not it doesn't. Every single character in the string matches._ (a few comments above) --- well, it matches but it has to _find_ the match for a character in hand, in whatever it considers for that pattern (so it'd be many more possibilities for `\w` etc). It's a way to check. (Sure, a failing mathc is much surer way.) – zdim Apr 16 '19 at 18:31
@ikegami "_Like I said,..._" -- OK, but why do you say "_matching one char_" --- it goes through the whole string with `/g`, in that one sub call, and if I'm not mistaken it doesn't restart the engine for each match either. – zdim Apr 16 '19 at 18:34
1

@ikegami "_Benchmarks are trivially easy to do wrong_" -- yeah, been dealing with them for decades now ... that's why I keep discussing this, since I think that this one is reasonable (as in not flawed/wrong/...) and if not I'd like to know why. Speaking of which -- a negative (failing) test being faster for `\w` just tells me that there's too much "magic" (optimizations) going on behind the scenes. – zdim Apr 16 '19 at 18:38

score 0 · Answer 3 · 2019-04-16T05:32:37.760

\w as far as I assume, should be depended on locale environment setup such;
LANG=
LC_CTYPE=
LC_ALL=
if mine so true then \w should be not just [A-Za-z_] as so many other UCS characters out there,
if it's set to LANG=en_US Imho is just [A-Za-z_], see Explain the effects of export LANG, LC_CTYPE, LC_ALL

\d could be as it is or it's [0-9] depends on regex engine, of course,
sed's \d can't be [0-9] even by its -E option, only better regex engine will be so, instead [0-9] represented by gnu sed with [[:digit:]]
Imho all regex shorthands preset for class set is faster then it's normal [] class set
\w, \d is faster then [A-Za-z_], [0-9] respectively
\W faster than [^A-Za-z_] and so on.

Re "*`\w` as far as I assume, should be depended on locale environment setup*", It is not (unless `/l` is used) in Perl — ikegami, Apr 16 '19 at 07:16

RegEx word performance: \w vs [a-zA-Z0-9_]

3 Answers3

Linked