I think that rather than matching words outside the quote marks, you could match words inside the quote marks and replace them with ''.
To that extent I suggest you have a look at this question and @RicardoNolde's answer:
(?>(?(STR)(?(ESC).(?<-ESC>)|\\(?<ESC>))|(?!))|(?(STR)"(?<-STR>)|"(?<STR>))|(?(STR).|(?!)))+
(See his question for a much better explanation than I could do, as I'm not familiar with the .NET engine).
This matches all words inside quotes. If you remove them (ie replace with '') and then just match the resulting string with @"\b(\w+)\b"
you'll be right.
However You will have problems unless in your string:
- all quote pairs are well-formed (ie even number of quotes in the entire text)
- all quote pairs match (ie no
\"
with corresponding "
like in your example)
- any nested quotes are escaped (
"This is a quote that contains another "quote", tricky!"
arguably contains "This is a quote that contains another "
and ", tricky!"
within quotes).
(The previous regex appears to work on your example for the \"this still shouldn't be matched"
, but if you change it to "this still shouldn't be matched\" but this should. "hi"
, you will have problems, as the internal \"
is regarded as an escaped quote and not as part of a balanced pair).
That being said, if your text satisfies those three rules I mentioned above, you can do what you want with ordinary regex (although I feel that since you're using .NET you may as well take advantage of its stack feature):
(?<!")\b[a-zA-Z]+\b(?=(?>((\\"|[^"])*)"(?>(\\"|[^"])*)")*(\\"|[^"])*$)
This means "match any words followed by an even number of unescaped quote marks."
The logic is that since quote marks are paired, if you are not within a set of quote marks, there are an even number of (unescaped) quote marks remaining.
See it in action here (The (?>...)
are to avoid the regex engine doing unnecessary back-tracking so that the performance is better).
(NOTE: I changed your unmatched quote marks \"this still shouldn't be matched"
to "this still shouldn't be matched"
so that the input obeys the three rules above).
Also note that you can't say "match any words followed by an even number of quote marks" (including escaped ones), as then you'll have problems with words inside nested quote marks matching. For example Hello world "this is a quote \"containing another quote\" end quote" goodbye
will erroneously have the internal another quote
match the regex as there are an even number of quote marks remaining in the string.
In summary
You really need all quote pairs to be well-formed/matched and nested quotes to be escaped in order for any sort of regex to work, .NET engine or not.
I recommend using @RicardoNolde's answer from the other question (linked above) to remove all quoted text, and then match all remaining words.