It seems that using a character class is faster than the alternation in an example like:
[abc]
vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes
I verified it (~10 times slower).
Also using (?:a|b|c)
in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?
-
4Think of the logic of the engine - in the first case you tell the engine that you are looking for one of 3 characters. The engine can optimise. In the second case the engine must check the first one, fail, backtrack, check the second one, fail, backtrack and finally check the last option. – Boris the Spider Mar 02 '14 at 19:49
-
1I've run some tests and extracted some debug info with `pcretest`. Here's [the results](http://pastebin.com/rkCr3GAr). Quite funny to see that PCRE somehow improves `[abc]` to `[a-c]` and that there seems to have a lot more "steps" when using a capturing group. – HamZa Mar 02 '14 at 19:56
-
possible duplicate of http://stackoverflow.com/questions/4724588/using-alternation-or-character-class-for-single-character-matching – devnull Mar 02 '14 at 19:59
-
@BoristheSpider:I am not sure about this difference.If we are looking for 1 of 3 characters we still need to test 3 times.One per character right? – Jim Mar 02 '14 at 20:02
-
Why the linux and perl tags? Are you only interested in the way perl parses regexes? I'm not aware of any regex parsing that's part of the Linux kernel. – ghoti Mar 02 '14 at 20:03
-
Well, this isn't really accurate, more illustrative, but... Imagine in the first case you can evaluate a single expression `return x == 'A' || x == 'b' || x =='c'`, in the second case the engine needs to restart the evaluation at a much higher level. – Boris the Spider Mar 02 '14 at 20:11
-
@ghoti:Because I tried the timing in perl. I removed linux tag – Jim Mar 02 '14 at 20:16
2 Answers
This is because the "OR" construct |
backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat
Pattern: [rf]at
Match string: carat
But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.

- 10,902
- 13
- 62
- 72
-
1
-
4@AhmedFasih Hi! The pictures are screenshots of the Regex Debugger feature from [regex101.com](http://regex101.com). You can get it to work by entering your regular expression in the bar for it, the test string in the test string area, and select "Regex Debugger" in the sidebar to the left. To get the dark theme like that of the screenshots, select the wrench icon in the top right of the web interface and choose "Dark Theme". – Unihedron Jun 10 '15 at 15:35
Because a character class like [abc]
is irreducable and can be optimised, whereas an alternation like (?:a|b|c)
may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t)
.
The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.
There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".

- 126,100
- 9
- 70
- 144
-
`[abc] is irreducable` Don't you mean reducable? I.e. can be reduced to something simpler? – Jim Mar 02 '14 at 21:01
-
@Jim: No, I mean that a character class is just a set of code points and cannot be described any more concisely. – Borodin Mar 02 '14 at 21:02
-
@Jim: If you still don't understand then please explore your question further. – Borodin Mar 02 '14 at 21:19
-
4@Jim FYI: If all alternatives in an alternation are constant strings (i.e. not something exotic like lookarounds), it will get optimized to a *trie*. So it's not true that an alternation disables any optimizations, it's just that a charclass is still much more efficient than the more general trie, and that nobody bothered to further complicate the regex engine by adding a single-letter trie → charclass optimization. To see the opcodes to which a regex was compiled, do something like `perl -Mre=debug -E'qr/a|b|c/'` – amon Mar 02 '14 at 21:22
-
1@amon: I resent you use of *"bothered"*. I doubt that a trie for single-character strings would be any faster over all, given that most character classes look like `[a-zA-Z]` or `[^"]`. – Borodin Mar 02 '14 at 21:26
-
+1. Just what I was trying to say but put in a way that makes sense! – Boris the Spider Mar 02 '14 at 21:50