0

I have a regular expression with the following pattern in C#

Regex param = new Regex(@"^-|^/|=|:");

Basically, its for command line parsing.

If I pass the below cmd line args it spilts C: as well.

/Data:SomeData /File:"C:\Somelocation"

How do I make it to not apply to characters inside double or single quotes ?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Frank Q.
  • 6,001
  • 11
  • 47
  • 62

2 Answers2

2

You can do this in two steps:

Use the first regex

Regex args = new Regex("[/-](?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

to split the string into the different arguments. Then use the regex

Regex param = new Regex("[=:](?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

to split each of the arguments into parameter/value pairs.

Explanation:

[=:]      # Split on this regex...
(?=       # ...only if the following matches afterwards:
 (?:      # The following group...
  [^"]*"  #  any number of non-quote character, then one quote
  [^"]*"  #  repeat, to ensure even number of quotes
 )*       # ...repeated any number of times, including zero,
 [^"]*    # followed by any number of non-quotes
 $        # until the end of the string.
)         # End of lookahead.

Basically, it looks ahead in the string if there is an even number of quotes ahead. If there is, we're outside of a string. However, this (somewhat manageable) regex only handles double quotes, and only if there are no escaped quotes inside those.

The following regex handles single and double quotes, including escaped quotes, correctly. But I guess you'll agree that if anybody ever finds this in production code, I'm guaranteed a feature article on The Daily WTF:

Regex param = new Regex(
    @"[=:]
    (?=      # Assert even number of (relevant) single quotes, looking ahead:
     (?:
      (?:\\.|""(?:\\.|[^""\\])*""|[^\\'""])*
      '
      (?:\\.|""(?:\\.|[^""'\\])*""|[^\\'])*
      '
     )*
     (?:\\.|""(?:\\.|[^""\\])*""|[^\\'])*
     $
    )
    (?=      # Assert even number of (relevant) double quotes, looking ahead:
     (?:
      (?:\\.|'(?:\\.|[^'\\])*'|[^\\'""])*
      ""
      (?:\\.|'(?:\\.|[^'""\\])*'|[^\\""])*
      ""
     )*
     (?:\\.|'(?:\\.|[^'\\])*'|[^\\""])*
     $
    )", 
    RegexOptions.IgnorePatternWhitespace);

Further explanation of this monster here.

Community
  • 1
  • 1
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • `/Data: '"'` will break your regex. This is a good answer, and fits the question asked, but this isn't really the sort of task that ought to be done with a Regular Expression on a production level. OP should write an actual lexer if he's worried about edge cases. – FrankieTheKneeMan Sep 20 '12 at 20:22
  • 1
    @FrankieTheKneeMan: You're right; I failed to address single quotes in this regex. How about the new regex? Hope it's not giving you nightmares :) – Tim Pietzcker Sep 20 '12 at 20:37
  • Holy crap, that's one helluva regex. – FrankieTheKneeMan Sep 20 '12 at 20:38
  • @FrankieTheKneeMan: Yeah, I think I'm coming down with a bad case of leaning toothpick syndrome. – Tim Pietzcker Sep 20 '12 at 20:40
  • It cannot handle App.exe /Input:"C:\path". I am expecting parameter = Input and value = C:\Path – Frank Q. Nov 10 '12 at 20:16
  • @FrankQ: Your original regex only splits on a `/` if it's the first character in the string, so I assumed that that's the intended behaviour. Isn't that so? In that case, just replace the `(?:^[-/]|[=:])` part of the regex with `[-/=:]`; if that's not what you want please explain what exactly you *do* want the regex to do. – Tim Pietzcker Nov 10 '12 at 21:15
  • Hi Tim, let me explain a little in detail. A parameter can be specified using either / or -. A parameter and value can be separated using : or = or space. . E.g App.exe /Input:"C:\File" -Out=D:\File /Out2 "D:\File2" -Action (Expected Result = Param is Input, Value is C:\File Param is Out, Value is D:\File Param is Out2, Value is D:\File2 Param is Action, Value is none ) Hope this is clear. – Frank Q. Nov 11 '12 at 00:13
0

You should read "Mastering Regular Expressions" to understand why there's no general solution to your question. Regexes cannot handle that to an arbitrary depth. As soon as you start to escape the escape character or to escape the escaping of the escape character or ... you're lost. Your use case needs a parser and not a regex.

Achim
  • 15,415
  • 15
  • 80
  • 144