1

Given a string in the following format:

xxx (aaa - bbb - CC-dd - ee-FFF)

I need to write a regex that returns a match if there are more than 3 " - " strings inside the parenthesis.

It also needs to split the string (by " - " - space, hyphen, space) and return each of those groups in a separate match. So given the above string, I expect the following matches:

  1. aaa
  2. bbb
  3. CC-dd
  4. ee-FFF

I have the following regex...

\((([\w]).*(.[-].*?){3,}([\w]))\)

but I'm struggling to split the string and return the matches I need.

tomsky
  • 535
  • 4
  • 11
  • 28
  • Does it have to be regex? Because it sounds like splitting on `" - "` might be easier ... – BurnsBA Oct 04 '18 at 12:37
  • Yes I'd prefer a regex solution. – tomsky Oct 04 '18 at 12:38
  • 2
    Why not simply do this in two steps? 1. capture everything between parentheses , 2. Split on ` - ` ? – Corion Oct 04 '18 at 12:38
  • 1
    It will be a very ugly regex, something like `\((?(?:(?! - )[^()])+)(?: - (?(?:(?! - )[^()])+)){2,}\)` ([demo](http://regexstorm.net/tester?p=%5c%28%28%3f%3co%3e%28%3f%3a%28%3f!+-+%29%5b%5e%28%29%5d%29%2b%29%28%3f%3a+-+%28%3f%3co%3e%28%3f%3a%28%3f!+-+%29%5b%5e%28%29%5d%29%2b%29%29%7b2%2c%7d%5c%29&i=xxx+%28aaa+-+bbb+CC-+dd+-+ee-FFF%29)). But it will do both validation and extraction (get all the captures of the "o" group). – Wiktor Stribiżew Oct 04 '18 at 12:38
  • @maccettura Correct, that's why I expect CC-dd not to be split and be part of one group – tomsky Oct 04 '18 at 12:39

2 Answers2

3

You may use a regex based on a tempered greedy token:

\((?<o>(?:(?! - )[^()])+)(?: - (?<o>(?:(?! - )[^()])+)){3,}\)

See the regex demo

Details

  • \( - a ( char
  • (?<o>(?:(?! - )[^()])+) - Group "o": any char other than ( and ), 1 or more occurrences, not starting the space-space sequence
  • (?: - (?<o>(?:(?! - )[^()])+)){3,} - three or more occurrences of
    • - - space - space
    • (?<o>(?:(?! - )[^()])+) - Group "o": any char other than ( and ), 1 or more occurrences, not starting the space-space sequence
  • \) - a ) char

Get all the Group "o" captures to extract the values.

C# demo:

var s = "xxx (aaa - bbb CC - dd - ee-FFF) (aaa2 - bbb2 CC2- dd2- ee2-FFF2)";
var pattern = @"\((?<o>(?:(?! - )[^()])+)(?: - (?<o>(?:(?! - )[^()])+)){3,}\)";
var ms = Regex.Matches(s, pattern);
foreach (Match m in ms) 
{
    Console.WriteLine($"Matched: {m.Value}");
    var res = m.Groups["o"].Captures.Cast<Capture>().Select(x => x.Value);
    Console.WriteLine(string.Join("; ", res));
}

Output:

Matched: (aaa - bbb CC - dd - ee-FFF)
aaa; bbb CC; dd; ee-FFF
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Ok, I tested on the sample text OP had at the beginning, and I understood there must be 3 items, not `space-space` substrings. Well, only the quantifier should be adjusted then, the approach is still valid. – Wiktor Stribiżew Oct 04 '18 at 12:56
  • Your linked answer mentions that the performance for this technique might be sub-par. I believe it could be fair to reiterate this warning here because I've witnessed first-hand that regexps in .NET can take a ridiculous amount of time to execute in certain cases, especially if the input is user-provided. – SirDarius Oct 04 '18 at 13:08
  • @SirDarius Well, we are talking about .NET regex here, and it is much more powerful than others. I just tested with `\((?(?: (?!- )|[^() ])+)(?: - (?(?: (?!- )|[^() ])+)){3,}\)` and it shows just a 0.7% increase in performance. `\((?[^() ]*(?: (?!- )[^() ]*)*)(?: - (?[^() ]*(?: (?!- )[^() ]*)*)){3,}\)` is 30% faster, but it may match empty items in Group "o". – Wiktor Stribiżew Oct 04 '18 at 13:20
0

This problem can be rephrased like this:

You need to split the text between parentheses using " - " as a delimiter, and determine if there are 4 or more text fragments.

How I would do this:

  1. Use a regexp to get the text, something like: \(([^\)]+)\)
  2. split the matched text using String.Split(" - ")
  3. check that the number of elements in the returned array is > 3

This looks more maintainable than a huge regular expression, and should be equivalent in terms of performance, if not faster.

SirDarius
  • 41,440
  • 8
  • 86
  • 100