3

I need to split text (sql query) by each comma which is not between parenthesis.

Example (I marked commas which should be included in split):

a."Id",                //<- this comma
a."Description",       //<- this comma
UJsonObject(
   fepv."Id",          //<- NOT this comma
   fepv."SystemName",  //<- NOT this comma
   string_agg(
        translations."Translations", ',' //<- NOT this comma (here can be some nested parenthesis also)
   ) as "Translations"
) as "Translations",   //<- this comma
b."DataSource",        //<- this comma
a."Name",              //<- this comma
a."Value"

I found universal solution here: https://regex101.com/r/6lQKjP/2 but it appears that this solution is not working in dotnet.

I would like to use Regex.Split, but if this case can be satisfied by Regex.Matches I will be happy too. Also I know I can write my own parser, but I read that simple cases (which not extract nested parenthesis) can be handled via Regex.

Adam Mrozek
  • 1,410
  • 4
  • 26
  • 49
  • 1
    You cannot "convert" the PCRE regex to .NET because .NET does not support skipping after faailing a match. You need to completely re-write this, use another approach. E.g., `var s = Regex.Replace(text, @"\((?>[^()]+|(?)\(|(?<-o>)\))*(?(o)(?!))\)|(,)", m=>m.Groups[1].Success ? "___temp___" : m.Value)` and then `var results = s.Split("___temp___");` – Wiktor Stribiżew Apr 09 '20 at 16:29
  • Why is https://www.mathworks.com/matlabcentral/answers/336123-skip-commas-in-brackets-regexp a universal solution? It looks like some workaround thatdoes not even do what is asked for. `\w+(\([^)]+\))?` simply matches 1+ word chars and an optional `(...)` substring that has no `)` inside. – Wiktor Stribiżew Apr 09 '20 at 16:45
  • Maybe but I tested it and works for me. Universal - I mean works with any sql expression and it not depends on syntax I use. In each case it selects correct commas. – Adam Mrozek Apr 09 '20 at 16:48
  • So, if it works for you - why ask a question? `\w+(\([^)]+\))?` is a valid .NET regex. – Wiktor Stribiżew Apr 09 '20 at 16:50
  • @WiktorStribiżew because I need this to work in dotnet I as wrote in question. – Adam Mrozek Apr 09 '20 at 16:50
  • It [works in .NET](http://regexstorm.net/tester?p=%5cw%2b%28%5c%28%5b%5e%29%5d%2b%5c%29%29%3f&i=a.%22Id%22%2c%0d%0aa.%22Description%22%2c%0d%0aUJsonObject%28%0d%0a+++fepv.%22Id%22%2c%0d%0a+++fepv.%22SystemName%22%2c%0d%0a+++string_agg%28%0d%0a++++++++translations.%22Translations%22%2c+%27%2c%27+%2f%2f%3c-+NOT+this+comma+%28here+can+be+some+nested+parenthesis+also%29%0d%0a+++%29+as+%22Translations%22%0d%0a%29+as+%22Translations%22%2c%0d%0ab.%22DataSource%22%2c%0d%0aa.%22Name%22%2c%0d%0aa.%22Value%22), but I can't see how it can work for you. – Wiktor Stribiżew Apr 09 '20 at 16:50
  • The `UJsonObject(....` matches up to the first `)`, not to the corresponding closing `)`. – Wiktor Stribiżew Apr 09 '20 at 16:54
  • There is some bug in my question. Worked example should navigate to regex101 not to mathworks. Sorry for misunderstanding. I updated my question. – Adam Mrozek Apr 09 '20 at 17:02

3 Answers3

5

This PCRE regex - (\((?:[^()]++|(?1))*\))(*SKIP)(*F)|, - uses recursion, .NET does not support it, but there is a way to do the same thing using balancing construct. The From the PCRE verbs - (*SKIP) and (*FAIL) - only (*FAIL) can be written as (?!) (it causes an unconditional fail at the place where it stands), .NET does not support skipping a match at a specific position and resuming search from that failed position.

I suggest replacing all commas that are not inside nested parentheses with some temporary value, and then splitting the string with that value:

var s = Regex.Replace(text, @"\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|(,)", m =>
     m.Groups[1].Success ? "___temp___" : m.Value);
var results = s.Split("___temp___");

Details

  • \((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\) - a pattern that matches nested parentheses:
    • \( - a ( char
    • (?>[^()]+|(?<o>)\(|(?<-o>)\))* - 0 or more occurrences of
      • [^()]+| - 1+ chars other than ( and ) or
      • (?<o>)\(| - a ( and a value is pushed on to the Group "o" stack
      • (?<-o>)\) - a ) and a value is popped from the Group "o" stack
    • (?(o)(?!)) - a conditional construct that fails the match if Group "o" stack is not empty
    • \) - a ) char
  • | - or
  • (,) - Group 1: a comma

Only the comma captured in Group 1 is replaced with a temp substring since the m.Groups[1].Success check is performed in the match evaluator part.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

You can match your tokens in a single pass using a .NET regular expression Balancing Groups:

(?>
    (?<S>\()      # if you see an open parentheses, push it to the stack
    |
    (?<-S>\))     # match a closing parentheses when the stack has a paired open parentheses
    |
    [^,()]        # match any character except parentheses or commas
    |
    (?(S),|(?!))  # if we're already inside parentheses, we're allowed to match a comma
)+
(?(S)(?!))    # at the end, make sure there are no extra open parentheses we didn't close.

You can get the tokens as:

var matches = Regex.Matches(input, pattern, RegexOptions.IgnorePatternWhitespace)
                   .Select(m => m.Value).ToList();

Working example in Sharp Labs

This approach is a bit complicated, but the syntax it supports can be expanded without too much trouble. For example, we can add support for -- single line SQL comments comments and 'SQL strings':

(?>
    (?<S>\()
    |
    (?<-S>\))
    |
    --.*                 # match from "--" to the end of the line
    |
    '[^']*(?:''[^']*)*'  # match SQL string, single quote, escaped by two single quotes
    |
    [^,()]
    |
    (?(S),|(?!))
)+
(?(S)(?!))

Working example

Kobi
  • 135,331
  • 41
  • 252
  • 292
0

Please regard this as an extended comment. In pseudo-code, the commas not enclosed in parentheses can be identified as follows:

commas = []
n = 0
for each index i of string
   c = char at index i of string
   if c == '('
     increase n by 1
   elsif c == ')'
     decrease n by 1 if n > 0, else raise unbalanced parens exception
   elsif c == ','
     add i to commas if n equals 0
   end
end
raise unbalanced parens exception if n > 0

The array comma will contain the indices of the commas on which the string is to be split. Splitting the string at given indices is straightforward.

The variable n equals the number of left parentheses that are not yet matched by a right parentheses. The code also confirms that the parentheses are balanced.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • This was my first solution. If none of regular expression was working then I will still using this algorithm. – Adam Mrozek Apr 09 '20 at 19:53
  • Considering that this approach is straightforward and undoubtedly faster than the use of a regular expression, why would you prefer the latter? – Cary Swoveland Apr 09 '20 at 19:59