19

How will I be able to look for kewords that are not inside a string.

For example if I have the text:

Hello this text is an example.

bla bla bla "this text is inside a string"

"random string" more text bla bla bla "foo"

I will like to be able to match all the words text that are not inside " ". In other I will like to match:

enter image description here

note I do not want to match the text that is highlighted on red because it is inside a string


Possible solution:

I been working on it and this is what I have so far:

(?s)((?<q>")|text)(?(q).*?"|)

note that regex uses the if statement as: (?(predicate) true alternative|false alternative)

so the regex will read:

find " or text. If you find " then continue selecting until you find " again (.*?") if you find text then do nothing...

when I run that regex I match the whole string though. I am asking this question for purposes of learning. I know I can remove all strings then look for what I need.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
Tono Nam
  • 34,064
  • 78
  • 298
  • 470
  • Have you tried an online regex generator such as: http://txt2re.com/index-csharp.php3 – Surfbutler Jul 23 '12 at 20:53
  • 2
    Why would you want to match a string that you know what is? What do you plan to do with the result. Intent is important for others to be able to give an appropriate answer. – Gaute Løken Jul 23 '12 at 20:55
  • You don't need to know the intent of the question in order to be able to answer it. Also you are assuming that he knows what the string is. He only gives examples to demonstrate what he is trying to do and those are not necessarily what he will be using finally.He's looking for a specific result and it is none of our business of how that result is to be used. – Richard Robertson Sep 20 '17 at 16:09

4 Answers4

26

Here is one answer:

(?<=^([^"]|"[^"]*")*)text

This means:

(?<=       # preceded by...
^          # the start of the string, then
([^"]      # either not a quote character
|"[^"]*"   # or a full string
)*         # as many times as you want
)
text       # then the text

You can easily extend this to handle strings containing escapes as well.

In C# code:

Regex.Match("bla bla bla \"this text is inside a string\"",
            "(?<=^([^\"]|\"[^\"]*\")*)text", RegexOptions.ExplicitCapture);

Added from comment discussion - extended version (match on a per-line basis and handle escapes). Use RegexOptions.Multiline for this:

(?<=^([^"\r\n]|"([^"\\\r\n]|\\.)*")*)text

In a C# string this looks like:

"(?<=^([^\"\r\n]|\"([^\"\\\\\r\n]|\\\\.)*\")*)text"

Since you now want to use ** instead of " here is a version for that:

(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text

Explanation:

(?<=       # preceded by
^          # start of line
 (         # either
 [^*\r\n]| #  not a star or line break
 \*(?!\*)| #  or a single star (star not followed by another star)
  \*\*     #  or 2 stars, followed by...
   ([^*\\\r\n] # either: not a star or a backslash or a linebreak
   |\\.        # or an escaped char
   |\*(?!\*)   # or a single star
   )*          # as many times as you want
  \*\*     # ended with 2 stars
 )*        # as many times as you want
)
text      # then the text

Since this version doesn't contain " characters it's cleaner to use a literal string:

@"(?<=^([^*\r\n]|\*(?!\*)|\*\*([^*\\\r\n]|\\.|\*(?!\*))*\*\*)*)text"
porges
  • 30,133
  • 4
  • 83
  • 114
  • Porges thanks for the help! if I where to have: `" \r\n text \r\n " bla bla...` that does not make a match... I guess the reason is because `[^"]` will continue to next line... – Tono Nam Jul 23 '12 at 21:18
  • 1
    @TonoNam: If you want it to match on a per-line basis then change both `[^"]` to `[^"\r\n]` and add `RegexOptions.Multiline` to the options. – porges Jul 23 '12 at 21:24
  • ```"(?<=^([^\"]|\"[^\"]*\")*)text"``` doesn't work if there is any text after the quoted text. – tponthieux Nov 07 '14 at 21:21
  • Great solution, but doesn't work when quoted string is multi line and matched word is on new line: https://regex101.com/r/UhAi9f/1 – Liphtier Feb 23 '22 at 12:27
8

This can get pretty tricky, but here is one potential method that works by making sure that there is an even number of quotation marks between the matching text and the end of the string:

text(?=[^"]*(?:"[^"]*"[^"]*)*$)

Replace text with the regex that you want to match.

Rubular: http://www.rubular.com/r/cut5SeWxyK

Explanation:

text            # match the literal characters 'text'
(?=             # start lookahead
   [^"]*          # match any number of non-quote characters
   (?:            # start non-capturing group, repeated zero or more times
      "[^"]*"       # one quoted portion of text
      [^"]*         # any number of non-quote characters
   )*             # end non-capturing group
   $              # match end of the string
)               # end lookahead
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
1

I would simply greedily match the text's in quotes within a non-capturing group to filter them out and then use a capturing group for the non-quoted answer, like this:

".*(?:text).*"|(text)

which you might want to refine a little for word-boundaries etc. But this should get you where you wanna go, and be a clear readable sample.

Gaute Løken
  • 7,522
  • 3
  • 20
  • 38
0

I have used these answers a lot of times till now and want to share alternative approach of fixing this, as sometimes I was not able to implement and use the given answers.

Instead of matching keywords out of something, break the tasks to two sub tasks:

  1. replace everything you do not need to match with empty string
  2. use ordinary match

For example, to replace the text in quotes I use:

[dbo].[fn_Utils_RegexReplace] ([TSQLRepresentation_WHERE], '''.*?(?<!\\)''', '')

or more clear: '.*?(?<!\\)'.

I know that this may looks like double work and have performance impact on some platforms/languages, so everyone need to test this, too.

gotqn
  • 42,737
  • 46
  • 157
  • 243