0

I'm looking to find every word not within double quotes using a .NET regular expression. Here's some sample text:

Hello world I want to get all of these words as a match "but not these ones...
because they're inside a string. And maybe I'll \"escape\" the quotes too." Also,
these words should match. Now we're outside of the string. And I can't escape
quotes; \"this still shouldn't be matched."

So I'd want to match:

Hello, world, I, want, to, get, all, of, these, words, as, a, match, Also,
these, words, should, match, Now, we, re, outside, of, the, string, And, I,
can, t, escape, quotes

Is this possible using the .NET regex external stack and assertions? I've gotten this far:

(?<=(?(rstack)|(?!))(?<-rstack>").*?(?<rstack>").*?)\w+... same thing for fstack

'Course, it doesn't work.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Ry-
  • 218,210
  • 55
  • 464
  • 476
  • Without the balancing grouping constructs there appears to be a trivial solution to get the required output (removing matches of `\".*?(?:$|(?<!\\)\")` and then finding matches of `\w+` in the resulting string). Since there is no balancing in the input (quotes anywhere appear to legal, there's no nesting, and opening/closing delimiters are identical), under what circumstances should the assertion fail? – drf Jan 23 '12 at 03:21
  • @drf: I was hoping never (if they're unbalanced, ignore them) but just about anything is okay. – Ry- Jan 23 '12 at 15:01

2 Answers2

2

I think that rather than matching words outside the quote marks, you could match words inside the quote marks and replace them with ''.

To that extent I suggest you have a look at this question and @RicardoNolde's answer:

(?>(?(STR)(?(ESC).(?<-ESC>)|\\(?<ESC>))|(?!))|(?(STR)"(?<-STR>)|"(?<STR>))|(?(STR).|(?!)))+

(See his question for a much better explanation than I could do, as I'm not familiar with the .NET engine).

This matches all words inside quotes. If you remove them (ie replace with '') and then just match the resulting string with @"\b(\w+)\b" you'll be right.

However You will have problems unless in your string:

  • all quote pairs are well-formed (ie even number of quotes in the entire text)
  • all quote pairs match (ie no \" with corresponding " like in your example)
  • any nested quotes are escaped ("This is a quote that contains another "quote", tricky!" arguably contains "This is a quote that contains another " and ", tricky!" within quotes).

(The previous regex appears to work on your example for the \"this still shouldn't be matched", but if you change it to "this still shouldn't be matched\" but this should. "hi", you will have problems, as the internal \" is regarded as an escaped quote and not as part of a balanced pair).


That being said, if your text satisfies those three rules I mentioned above, you can do what you want with ordinary regex (although I feel that since you're using .NET you may as well take advantage of its stack feature):

(?<!")\b[a-zA-Z]+\b(?=(?>((\\"|[^"])*)"(?>(\\"|[^"])*)")*(\\"|[^"])*$)

This means "match any words followed by an even number of unescaped quote marks." The logic is that since quote marks are paired, if you are not within a set of quote marks, there are an even number of (unescaped) quote marks remaining.

See it in action here (The (?>...) are to avoid the regex engine doing unnecessary back-tracking so that the performance is better). (NOTE: I changed your unmatched quote marks \"this still shouldn't be matched" to "this still shouldn't be matched" so that the input obeys the three rules above).

Also note that you can't say "match any words followed by an even number of quote marks" (including escaped ones), as then you'll have problems with words inside nested quote marks matching. For example Hello world "this is a quote \"containing another quote\" end quote" goodbye will erroneously have the internal another quote match the regex as there are an even number of quote marks remaining in the string.

In summary

You really need all quote pairs to be well-formed/matched and nested quotes to be escaped in order for any sort of regex to work, .NET engine or not.

I recommend using @RicardoNolde's answer from the other question (linked above) to remove all quoted text, and then match all remaining words.

Community
  • 1
  • 1
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • I would prefer not to do the replacing method and I can make the string conform to those rules and do some post-processing, so I'll use that. Thanks! – Ry- Jan 23 '12 at 15:11
1

This expression uses balancing groups to return the required words. After matching the expression, the words within quotes can be accessed as m.Groups["word"].Captures.OfType<Capture>.Select(c=>c.Value). By including an optional assertion in the pattern, the match can fail if quotes are unbalanced; if removed from the expression, extraneous quotes are ignored.

The following is a driver that includes the pattern and prints the desired output.

string input = @"Hello world I want to get all of these words as a match ""but not these ones...  because they're inside a string. And maybe I'll \""escape\"" the quotes too."" Also,  these words should match. Now we're outside of the string. And I can't escape  quotes; \""this still shouldn't be matched.""";
string pattern = @"(?>
                     ^(?:
                       #capture word only if not inside a quotation
                       (?(quote)\w+|(?<word>\w+))
                         (?:
                           ([^\w""]*|$)
                             (?(quote)
                                  #if within a quote, close unless escaped
                                  (?:(?<=\\)\""|(?<-quote>(?<!\\)\""))
                                  |
                                  #if not within a quote, open quote
                                  (?<quote>\"")
                             )?
                         )*
                       )*
                     )$
                     (?(quote)(?!)) # will fail to match if extra quotes
                                    # if line removed, will ignore extra quote";

RegexOptions options = RegexOptions.IgnorePatternWhitespace;

Match m = Regex.Match(input, pattern, options);
if (!m.Success) Console.WriteLine("Failed");
else
    foreach (
      var word in m.Groups["word"]
                   .Captures
                   .OfType<Capture>()
                   .Select(a => a.Value))
           Console.WriteLine(word);
drf
  • 8,461
  • 32
  • 50