regex split on comma but not inside ANY parentheses with recursion in mind

Question

I have an example string:

string myString = "value,value,(value,(value, value, value, (value), value),value)";

The goal is to iterate through it and deserialize it into a hierarchy of class objects.

The reason why most other examples asking a similar question on here will not work is due to the recursion, looking ahead (or back) for even number of parentheses will not work.

I have considered storing it as JSON, but the object types of value will vary without notice and that has proved to confused even json.net in the past especially since the types will likely all be related by inheritance.

So, given the example string, the goal is to split on comma ",", but ignore everything in parentheses until my recursion loop digs into that sub set then uses the same regex to split its contents.

I have no code yet as I am still brainstorming this method.

Also note that the sub lists may not necessarily be the last element in the parent list as demonstrated by a couple lingering value's in my example at the end.

Please do not mark as duplicate without fully reading the question and understanding why it is NOT the same as questions like this

That question is for Java. It certainly is not suitable here. Have you tried anything in .NET? Did you try using [balanced constructs](http://www.regular-expressions.info/balancing.html)? — Wiktor Stribiżew, Feb 01 '16 at 13:24
@WiktorStribiżew regex is pretty standard across languages (usually). A solution using regex for java will likely work with minor if any modifications in c#. — Wobbles, Feb 01 '16 at 13:26
See [*RegEx standards across languages*](http://stackoverflow.com/questions/12739633/regex-standards-across-languages). And Java regex is much poorer than .NET regex (although, there is some interesting difference in favor of Java regarding some Unicode string handling behavior, IMHO). — Wiktor Stribiżew, Feb 01 '16 at 13:29
@Wobbles This is one of these rare cases where a regex-based solution for C# would *not* work in Java and in most other languages. — Sergey Kalinichenko, Feb 01 '16 at 13:31
right.. and? as I said, modifications may be needed (in syntax), but the logic will likely be the same. But its all moot anyway as I pointed out that regardless of language that example will NOT work. — Wobbles, Feb 01 '16 at 13:32
@Wobbles: No, it's false, and it isn't a simple question of syntax but more a question of available features in the regex flavour you choose. — Casimir et Hippolyte, Feb 01 '16 at 13:36
@MattTimmermans, I am aware I could logic my way through it, something to the sort of split of first (, and last ), store the subset then de-serialize remainder then pass subset back into this method creating a recursion, but there has to be a way to cut some of the fat off that using pure regex. — Wobbles, Feb 01 '16 at 13:45
i'm agreee with @MattTimmermans and also you should keep in mind that pure regular expression can't handle balanced brackets. — Ilia Maskov, Feb 01 '16 at 14:16
I think i'm just going to alter things by creating a dummy class that has enumerable children just like my current one but has a property that holds the string equivalent of the type, serialize and deserialize from that class object into JSON then iterate that class object to populate my real class object. Seems the most rock solid way to do so since JSON great at deserializing recursively into known types. — Wobbles, Feb 01 '16 at 16:06
Don't use RegEx. It can't solve this. Use a simple Token Scanner and Parser. Very simple — NineBerry, Apr 01 '23 at 20:27

score 2 · Answer 1 · edited May 23 '17 at 12:15

Although C# regular expression has a feature that lets you match recursively parenthesized groups (see this Q&A for an example) it is much easier to define such regex for the positive case (i.e. "match a word or an entire parenthesized group") vs. the negative case needed for the split (i.e. "match the comma unless it is inside a parenthesized group").

Moreover, in situations when you would like to apply the same regex recursively, there is an advantage to building a simple Recursive Descent Parser.

At the heart of the parser would be the split logic that counts parentheses while searching for commas, and splits when parentheses level is zero:

var parts = new List<string>();
var parenLevel = 0;
var lastPos = 0;
for (var i = 0 ; i != s.Length ; i++) {
    switch (s[i]) {
        case '(':
            parenLevel++;
            break;
        case ')':
            parenLevel--;
            if (parenLevel < 0) {
                throw new ArgumentException();
            }
            break;
        case ',':
            if (parenLevel == 0) {
                parts.Add(s.Substring(lastPos, i-lastPos));
                lastPos = i + 1;
            }
            break;
    }
}
if (lastPos != s.Length) {
    parts.Add(s.Substring(lastPos, s.Length - lastPos));
}

Demo.

score 1 · Answer 2 · answered Apr 01 '23 at 19:48

Try this pattern:

,(?<!\((?>(?:[^()]|(?<p>\))|(?<-p>\())*))

Note this will only work for C#/.NET.
The regex engines for Java/JavaScript/Python/Perl/etc do not support the balancing groups feature that allows this pattern to handle nested parentheses.

Test it out here:
http://regexstorm.net/tester?p=%2c%28%3f%3c!%5c%28%28%3f%3e%28%3f%3a%5b%5e%28%29%5d%7c%28%3f%3cp%3e%5c%29%29%7c%28%3f%3c-p%3e%5c%28%29%29*%29%29&i=value%2cvalue%2c%28value%2c%28value%2c+value%2c+value%2c+%28value%29%2c+value%29%2cvalue%29

And here's an explanation of the pattern (as generated by .NET 7's regex source generator):

/// <remarks>
/// Pattern explanation:<br/>
/// <code>
/// ○ Match ','.<br/>
/// ○ Zero-width negative lookbehind.<br/>
///     ○ Loop greedily and atomically any number of times right-to-left.<br/>
///         ○ Match with 3 alternative expressions.<br/>
///             ○ Match a character in the set [^()] right-to-left.<br/>
///             ○ "p" capture group.<br/>
///                 ○ Match ')' right-to-left.<br/>
///             ○ Non-capturing balancing group. Uncaptures the "p" capture group.<br/>
///                 ○ Match '(' right-to-left.<br/>
///     ○ Match '(' right-to-left.<br/>
/// </code>
/// </remarks>

regex split on comma but not inside ANY parentheses with recursion in mind

2 Answers2