3

When constructing a regular expression for matching a list of candidate strings, how to ensure all the strings can be matched? For example,

This regular expression (?:O|SO|S|OH|OS)(?:\s?[-+*°.1-9]){0,4} can match all the examples below

O 4 2 -
O 2 - 
SO 4 * - 2 
S 2- 

However, if I swap S and SO, the resulting regular expression (?:O|S|SO|OH|OS)(?:\s?[-+*°.1-9]){0,4} failed to match the SO 4 * - 2 as a whole, instead it is separated into two matches: S and O 4 * - 2.

So my confusion is how to order the list of candidate strings in the regular expression, so that all of them can be safely and uniquely matched? Since the actual list of candidate strings in my project is a bit more complicated than the example, is there a sorting algorithm that can achieve this?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
meTchaikovsky
  • 7,478
  • 2
  • 15
  • 34
  • @41686d6564standsw.Palestine No, it does not work for me, the SO4 is not matched, see https://regex101.com/r/zbBHWG/1 – meTchaikovsky Apr 25 '22 at 02:10
  • Ah, my bad. Well, the problem is that everything in your pattern after the first non-capturing group is optional. So, `S` alone is considered a match. Then, because another match can start with `O` and the remaining text satisfies the rest of the pattern, it's considered a new match. You just have to order the elements in your non-capturing group in a way that doesn't allow "sub-matches" to be caught first. The easiest way to achieve that is to list them in a descending order by length. – 41686d6564 stands w. Palestine Apr 25 '22 at 02:17
  • Different RE implementations handle this differently. Some regex engines try to get the largest match for each component. Some engines try to make the whole expression match as much as possible. – Tim Roberts Apr 25 '22 at 02:17
  • Note that you don't _have_ to order them by length; that's just the easiest way that should work regardless of what elements you have. If you want to be more specific about it, just make sure the former ones cannot override latter ones. For example, don't put `S` before `SO` and don't put `O` before `OS` or `OH`. – 41686d6564 stands w. Palestine Apr 25 '22 at 02:22
  • The interesting part here is not only the reordering of the alternatives, it is the word boundary that prevents a partial match and changing the quantifier to `{1,4}` to match at least one of the characters in the character class. – The fourth bird Apr 25 '22 at 13:03

3 Answers3

1

You could add \b word boundary assertions to ensure that O and S match a whole word.

\b(?:O|S|SO|OH|OS)\b(?:\s?[-+*°.1-9]){0,4}

Demo: https://regex101.com/r/xmrZP0/1

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • This is great and all (in that it fixes this specific problem), but it does not address the underlying one. – 41686d6564 stands w. Palestine Apr 25 '22 at 02:24
  • I think it does address the underlying problem. It's more robust than carefully ordering the `|` options. If there's a deficiency can you post an example that it handles incorrectly so I can improve it? – John Kugelman Apr 25 '22 at 02:26
  • Sure, suppose the pattern was something like `(?:O|S|SO|OH|OS).??`. A word boundary wouldn't help here. You'd have to order them correctly if you want to ensure that "SO" has a higher priority than "S", for example. Note that I'm only bringing this up because the OP has mentioned that their real-life scenario is more complicated. Otherwise, this would be a good solution that works for this specific case. – 41686d6564 stands w. Palestine Apr 25 '22 at 02:35
1

The regular expression engine tries to match the alternatives in the order in which they are specified.

So when the pattern is (S|SO)? it matches S immediately and continues trying to find matches. The next bit of the input string is O4*-2 which cannot be matched.

So, I think the trick here to match all given string.

(?:O|S)(?:O|H|S)*(?:\s?[-+*°.1-9]){0,4}

Demo: https://regex101.com/r/3AwQP7/1

GAVD
  • 1,977
  • 3
  • 22
  • 40
1

You could repeat the character class 1 or more times to prevent matching only single uppercase characters from the alternation and reorder the alternatives:

\b(?:SO|OS|O[HS]|[SO])(?:\s?[-+*°.1-9]){1,4}

The pattern matches:

  • \b A word boundary to prevent a partial word match
  • (?: Non capture group for the alternatives
    • SO|OS|O[HS]|[SO] Match either SO OS OH OS S O
  • ) Close the non capture group
  • (?:\s?[-+*°.1-9]){1,4} Repeat 1-4 times an optional whitespace char and 1 of the listed characters

See a regex101 demo.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70