Performance optimizing a regex (regular expression) that matches a specific URL schema

Question

I am working on optimizing a regex (regular expression) that will match the following URL schema:

protocol://anything1/folder/index.html[?param=anything2]

where items in brackets are optional, and anything1 and anything2 can each be sequences of any characters. Everything else is static literals.

If it matters, anything1 will range in length from 36 to 48 characters, and anything2 will range in length from 5 to 40 characters. The regex does not need to validate any of this.

Importantly, both anything1 and anything2 can include forward slashes.

There are no issues if the regex requires anything1 or anything2 to be at least 1 character, as it always will be, but as performance is most important, I'm fine if it matches 0 or 1+ characters for anything1 and/or anything2.

The regex is only used for matching, and not for parsing. Captured groups are not used elsewhere in the code.

Most importantly, I would like the regex to be as efficient (in regards to speed) as possible.

So far, I have:

^protocol://.+/folder/index\.html($|\?param=.+)

The regex must match the entire string, and not just part of it.

The regex engine is the one used internally by Firefox for its CSS engine (which I believe is the same as their JavaScript regex engine).

My regex works as expected, and I'm asking if it can be further optimized for performance.

Depending on `anything1` expected length in _your_ specific cases, you might consider making the corresponding `.+` part lazy (with `?` modifier). But as for 'one-size-fits-all' scenario it seems to be a decent choice. — raina77ow, Oct 10 '20 at 15:47
@raina77ow Thank you for your comment. `anything1` will be a standard UUID (32 hex digits, plus 4 dashes), although it does not require any validation by the regex (validated elsewhere). — RockPaperLz- Mask it or Casket, Oct 10 '20 at 15:49
@Thefourthbird Thanks for mentioning that. Yes, it's fine if `anything1` matches a dot or a space. `anything1` and `anything2` are validated elsewhere before hitting the regex. — RockPaperLz- Mask it or Casket, Oct 10 '20 at 15:50
If you want to have your regex optimised, could you publish it? For I am not sure that `^protocol://.+/folder/index\.html($|\?param=.+)` is literally the regex you work with. Does it really have `protocol` in it? Also, what do you want to achieve with your regex? Only matching, or parsing? — Alexander Mashin, Oct 10 '20 at 15:53
If `anything1` can be a uuid which does not have to be validated you might use a negated character class. If you don't need the capturing group at the end you can make it non capturing. But I don't know if firefox optimizes the patterns when executed, so you would have to performance test it. `^protocol://[^\/\s]+/folder/index\.html(?:\?param=.+)?$` — The fourth bird, Oct 10 '20 at 15:58
@AlexanderMashin Thank you for your interest. The regex published above is literally the regex I wrote, just with the literals changed to make it much easier to read. Matching is the only function of this regex (no parsing). — RockPaperLz- Mask it or Casket, Oct 10 '20 at 15:59
@Thefourthbird After reading your comment above, I think it will make a good answer. I do not need the contents of any captured groups. Does making a group non-capturing improve its performance? — RockPaperLz- Mask it or Casket, Oct 10 '20 at 16:01
@Thefourthbird Just FYI, since validation is handled elsewhere, your proposed regex can be simplified by removing `\s`. — RockPaperLz- Mask it or Casket, Oct 10 '20 at 16:03
There are some post about capturing groups and performance https://stackoverflow.com/questions/33243292/capturing-group-vs-non-capturing-group or https://stackoverflow.com/questions/41444807/why-is-regex-search-slower-with-capturing-groups-in-python Could not yet find if it also applies to Firefox though. Maybe somebody else knows.. — The fourth bird, Oct 10 '20 at 16:07
Actually, upon reading it again, I don't think your proposal would quite meet the specified schema. Technically, the spec allows `anything1` to include a forward-slash. I'm going to take a look at some of the code to see if that ever happens, but I think it does. For example: `protocol://UUID/folderxyz/folder/index.html` — RockPaperLz- Mask it or Casket, Oct 10 '20 at 16:07
@Thefourthbird I have updated the question to answer your questions and provide additional details, such as the fact that `anything1` can include forward slashes. — RockPaperLz- Mask it or Casket, Oct 10 '20 at 16:20
Then you might opt to match the range from 36 to 48 chars `^protocol://.{36,48}/folder/index\.html(?:\?param=.+)?$` — The fourth bird, Oct 10 '20 at 16:23
@Thefourthbird Interesting idea! Would the `.{36,48}` be faster than `.+` in this situation? It requires another 2 variables for the regex engine to check, but also allows a potentially earlier match fail point. Also note that `protocol` is unique, and will not match 99+% of strings being passed to the regex. — RockPaperLz- Mask it or Casket, Oct 10 '20 at 16:27
@Thefourthbird I see 3 possible improvements you made to my regex: (1) adding a length range for `anything1`, (2) preventing capturing a group, (3) changing how `anything2` is processed. If you write an answer explaining (in as much or as little detail as you prefer!) why those 3 changes each will likely improve performance, it will be an incredible (and most appreciated!) answer. — RockPaperLz- Mask it or Casket, Oct 10 '20 at 16:33
Do you mean the `file:` when you say protocol, or something else ? _http:_ and domain ? — , Oct 10 '20 at 19:37
@Maxt8r For this question, I literally mean the static string `protocol`. — RockPaperLz- Mask it or Casket, Oct 12 '20 at 14:04

score 1 · Answer 1 · answered Oct 11 '20 at 01:06

Instead of using ^protocol://.+/folder/index\.html($|\?param=.+), you could make the pattern a bit more specific so there is less work for the engine:

^protocol://.{36,48}/folder/index\.html(?:\?param=.+)?$

A few recommendations could be

Change the first .+ to the known limit of 36-48 characters according to the comments.

The .+ will first match until the end of the string, then it starts backtracking to match / and the rest of the pattern. Knowing the range upfront might cut down on the backtracking.
If you don't need the last group, you can turn it into an optional non capturing group so there is no group data to store. See for example 2 pages about why using capturing groups is slower than using non capturing groups.

See capturing group VS non-capturing group and Why is regex search slower with capturing groups in Python?.
As an indication, you can see for example the difference in steps with less backtracking (with PCRE selected on regex101, the Javascript option does not yield any steps)

Original pattern and updated pattern

I don't know if Firefox does optimization beforehand, so if you really want to be sure if there is a performance win, you would have to benchmark it in Firefox.

Note that the . can match any character except a newline (and will also match a space), which does not take any data structure into account.

Thank you for your great answer. I'm reading it multiple times, as well as all the links you kindly provided. Thank you again! — RockPaperLz- Mask it or Casket, Oct 12 '20 at 14:08

Alexander Mashin · Answer 2 · 2020-10-10T16:45:45.077

0

Try something along this lines:

^protocol:\/\/[a-z\d-]+([^?]+)?\/folder\/index\.html(\?[^=&]+=[^&]+)?$

Note the usage of early-failing character classes to prevent backtracking (you may want to add more characters to those [^]'s).

https://regex101.com/r/h3x5vH/2.

edited Oct 10 '20 at 16:45

answered Oct 10 '20 at 16:05

Alexander Mashin

3,892
1
9
15

After looking at your proposal, I'm guessing that you think this will likely be faster due to the early-failing negative character classes. Am I correct? Note that `anything1` can include forward slashes, so some things will have to change. BTW, the protocol is a custom protocol that will not match 99+% of strings passed to the regex. Given that fact, will the regex I already wrote be just as fast as the one you proposed? – RockPaperLz- Mask it or Casket Oct 10 '20 at 16:23
You can easily replace `https?` with `protocol`. If `anything1` can contain slashes, I see no reason to separate it from `folder` and `index` unless those are constants and are to be taken literally. If so, try something like `^protocol:\/\/[a-z\d-]+([^?]+)?\/folder\/index\.html(\?[^=&]+=[^&]+)?$`. I assume, there is a traditional domain name in your URL. – Alexander Mashin Oct 10 '20 at 16:34
Thanks. Correct, `folder` and `index` are constants and are to be taken literally. What I'm wondering is, since `^protocol` will almost always (99+%) generate a match fail, does anything after it really matter much? Or even with the caret before `protocol`, can the regex after `^protocol` still make a substantial performance difference? – RockPaperLz- Mask it or Casket Oct 10 '20 at 16:40
1

I think that a caret before `protocol` is important. I think that without it, re regex engine will consume the whole line before giving up. And if most of your strings do not begin with `^protocol` and you are OK with few false positives, you can drop the rest altogether, although it doesn't cost anything after the regex is compiled. – Alexander Mashin Oct 10 '20 at 16:45
Thank you for your helpful feedback! – RockPaperLz- Mask it or Casket Oct 12 '20 at 14:07

Performance optimizing a regex (regular expression) that matches a specific URL schema

2 Answers2