How to remove the slowness of this regex?

Question

I have the following Regex:

(\d+\s+[-]\s+.*?(?=\s+-)|\d+\s+[-].*)

Regex will use this text

"Option 01 - Random phrase - Top Menu",
"Option 02 - Another Random Phrase - Su Menu",
"Option 03 - More 01 Phrase - Menu",
"Option 04 - More Phrase -",
"Option 05 - Simple Phrase"

To stay like this

01 - Random phrase ",
02 - Another Random Phrase ",
03 - More 01 Phrase ",
04 - More Phrase ",
05 - Simple Phrase ",

The function of this Regex is to get the number with dash at the beginning, and go before the last dash. Ex:

dfhdjfhdjf01 - text text - dkfdçsjf

When there is no trace in the end basically this happens:

dfhdjfhdjf01 - text text dkfdçsjf

However, debugging this regex on regex101.com accuses you of having 63 to 122 steps. That is, this regex is very slow.

Before criticizing the question, I have read all the documentation of the regex, I want you to know that I am referring to specific terms .. A problem that needs to be solved. After all, is not the site for this?

Tell me, how am I going to solve the slowness of this regex?

My main criticism of the question would be that you don't specify what exactly it is you're trying to do or what the range of input can be before asking for a way to simplify the pattern. — CAustin, Mar 21 '19 at 21:42
You can get rid of the alternation and repeated pattern at the end and just add `|$` in the lookahead. Would something like [`\d+\s+-\s+.*?(?=\s+-|$)`](https://regex101.com/r/DAIM3r/1) be what you're looking for? — 41686d6564 stands w. Palestine, Mar 21 '19 at 21:43
My friend, I'm trying to eliminate the various steps you have in that regex. This regex is heavy, I want it to be lighter. — sYsTeM, Mar 21 '19 at 21:44
Why do you need to simplify the regex? Why is 122 steps too many? Is this actually causing you problems? — Ben, Mar 21 '19 at 21:45
There's no way for us to know how it can be "lighter" unless you explain the rules of what you're trying to match. Without context, the only simplification can be removing strictly redundant patterns, like replacing `[-]` with just `-`. Anything beyond that could be removing functionality based on assumption. — CAustin, Mar 21 '19 at 21:47
@Ben Yeah, The way I worked on this regex, it's bothering me. — sYsTeM, Mar 21 '19 at 21:48
If your regex bothers you, give a chance to Expresso: http://www.ultrapico.com/expresso.htm It's a free desktop tool which explains one's constructs and assists in designing and validating a solution. My favorite. — Darek, Mar 21 '19 at 21:55
As others have pointed out, a regex doesn't make sense if it doesn't capture any reasonable submatches from the text. So we're somewhat lost in speculation here if you don't tell us what kind of parsing this regex is supposed to perform. If you just want a fast match, try `.*`, which will certainly match in very few steps. But that's probably not what you are looking for... — SBS, Mar 22 '19 at 07:25
@SBS The function of this regex is to capture this text template before the last stroke: (**01 - text text** -) if there is no trace it captures everything until the end, as long as it keeps the pattern at startup. I Update my question. — sYsTeM, Mar 22 '19 at 10:07
@sYsTeM OK, thanks for the clarification! In this case - what do you think about this one: \d+[^-]*-[^-"]* It does your test on regex101.com in 103 steps. I've updated my answer below. — SBS, Mar 22 '19 at 10:51

Wiktor Stribiżew · Accepted Answer · 2019-03-21T22:01:10.737

You should not worry too much about the steps you see at regex101.com, because C# regex library is very reliable. If you test a simple regex like (?s)a.*?b at regex101 with a very long string, it will report catastrophic backtracking while it will work just fine in C# code.

There is a way to improve your pattern since it has some redundancy: see the repeating \d+\s+[-] pattern.

All you need is

\d+\s+-.*?(?=\s+-|$)

See the regex demo at regex101 and RegexStorm.

If the .*?(?=\s+-) should only match if there is whitespace after -, use

\d+\s+-(?:\s.*?(?=\s+-)|.+)

See another demo 1 (fewer steps :)) / demo 2.

If you want to optimize it further, you might want to study the unroll-the-loop principle that leads to

\d+\s+-(?:\s+\S*(?:\s(?!\s*-)\S*)*|.+)

See this regex demo (the fewest step amount).

Here, \S*(?:\s(?!\s*-)\S*)* is the equivalent (almost) of .*?(?=\s+-|$), but is more efficient as the chunks up to a whitespace are matched in "batches", the checks for a hyphen are made only when a whitespace is encountered.

Details

\d+ - 1+ digits
\s+ - 1+ whitespaces
- - a hyphen
.*?(?=\s+-|$) - any 0+ chars, as few as possible, up to the first occurrence of 1+ whitespaces and - or up to the end of the string.
(?:\s.*?(?=\s+-)|.+) - a non-capturing group:
- \s.*?(?=\s+-) - whitespace, 0+ chars as few as possible, up to 1+ whitespaces and -
- | - or
- .+ - the rest of the string.
\S*(?:\s(?!\s*-)\S*)* :
- \S* - 0+ non-whitespace chars
- (?:\s(?!\s*-)\S*)* - 0 or more repetitions of
  - \s - a whitespace
  - (?!\s*-) - not followed with 0+ whitespaces and -
  - \S* - 0+ non-whitespace chars

Great illustration, Congratulations. – sYsTeM Mar 21 '19 at 22:49 — sYsTeM, Mar 21 '19 at 22:49

score 2 · Answer 2 · answered Mar 21 '19 at 23:04

2

You can also try \d+\s+-[^-]* to get to what you want. This has the lowest number of steps so far. Or you could add \d+\s+-[^-]*(?=\s) in case you need to cut it just before the -. The demo

answered Mar 21 '19 at 23:04

Onyambu

67,392
3
24
53

Well, it really depends on what work this regex should do - i.e. which substrings it should capture. If just some kind of match is needed, I'd propose `.*`, which requires just 30 steps. But that would be nonsense. – SBS Mar 22 '19 at 07:17

SBS · Answer 3 · 2019-03-22T10:49:17.490

As others have pointed out in the comments, it's not clear what your Regex is supposed to do, because you don't seem to want to capture anything from a potential match. But anyway, I'd recommend the following Regex, which parses an option string into its basic components:

^[^\d]*\d+\s+-\s+.*?(?:\s+-\s+.*?)?$

From this starting point, you can add parentheses around the parts you want to capture. For example:

^[^\d]*(\d+)\s+-\s+(.*?)(?:\s+-\s+(.*?))?$

This would capture the option number and the texts between the dashes. The third capture will be empty for options 04 and 05.

EDIT: Now that the author of the question has clarified which substrings should be captured, I guess this simple and straightforward regex is appropriate:

\d+[^-]*-[^-"]*

It captures the option number, searches the first dash, then captures everything up to the next dash or quote:

<01 - Random phrase >
<02 - Another Random Phrase >
<03 - More 01 Phrase >
<04 - More Phrase >
<05 - Simple Phrase>

Note that the angular brackets are just added here to show the trailing spaces. Is this what you wanted?

@sYsTeM Yes, so just try it on https://regex101.com/r/vzz8Dw/1 - my regex requires 365 steps, compared to 463 required by your's. — SBS, Mar 22 '19 at 05:38
@sYsTeM See my edit. Capturing is now done according to your recent specification, and your test on https://regex101.com/r/vzz8Dw/1 counts 103 steps. — SBS, Mar 22 '19 at 10:53

How to remove the slowness of this regex?

3 Answers3