0

I am trying to find the right way to get this regex:

\b([A-Za-z]|_)+\b

To not match whole words between quotes (' AND "), so:

example_ _where // matches: example_, _where
this would "not be matched" // matches: this and would
while this would // matches: while this would
while 'this' would // matches: while and would

Additionally I am trying to find out how to include words containing numbers, but NOT only numbers, so again:

this is number 0 // matches: this, is and number
numb3r f1ve // matches: numb3r and f1ve
example_ _where // matches: example_ and _where
this would "not be 9 matched" // matches: this and would
while 'this' would // matches: while and would

The goal is to try and match only words that would be valid variable names in most common programming languages, without matching anything in a string.

  • Have you tried fiddling with it in sites like: https://regex101.com/ or https://www.debuggex.com/ ? – Noctis Jun 19 '15 at 10:23
  • possible duplicate of [Skip text between quotes in Regex](http://stackoverflow.com/questions/18403739/skip-text-between-quotes-in-regex) – Spider man Jun 19 '15 at 10:27

4 Answers4

2

This should work:

"[a-zA-Z_0-9\s]+"|([a-zA-Z0-9_]+)

The idea here is, that if the words are surrounded by ", we won't record the matches.

demo :)

greenfeet
  • 677
  • 1
  • 7
  • 27
  • 1
    Wow I can't believe there was such a simple answer to this question, to accept single quotes as well and with @Richard's remark that \w matches alphanumeric plus underscore, I modified it to be: "[\w\s]+"|'[\w\s]+'|([\w]+) –  Jun 19 '15 at 11:51
0

Use the below regex and then get the string you want from group index 1.

@"""[^""]*""|'[^']*'|\b([a-zA-Z_\d]*[a-zA-Z_][a-zA-Z_\d]*)\b"

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

Since .NET allows variable length lookbehind you can use this regex:

\b\w+\b(?!(?<="[^"\n]*)[^"\n]*")

This matches a word if it is not followed and preceded by a double quote.

anubhava
  • 761,203
  • 64
  • 569
  • 643
0

To simplify things \w matches alphanumerics plus underscore.

So \b\w+\b will match one or more such characters with word boundaries.

Avoiding the quotes cannot be simply [^'"]\b\w+… will fail if there is no character proceeding the target string (eg. at the beginning), but a negative look-behind does not. A negative lookahead solves the quote after:

(?<!['`"])\b\w+\b(?!['`"])

(Because those are negative groups, do not negate the character classes.)

To not match all numbers, again a lookahead can be used:

(?<!['`"])\b(?!\d+\b)\w+\b(?!['`"])

Explaination:

  • The character before is not ['"]
  • A word boundary
  • Following characters to the next word boundary are not all digits.
  • "Word" characters (this will be be the match
  • Word boundary
  • Now followed by a quote.

Test, in PowerShell:

PS> [regex]::Matches("foo 'asas' bar 123456 ba55z 1xyzzy", "(?>!['`"])\b(?!\d+\b)\w+\b(?!['`"])")

Groups   : {foo}
Success  : True
Captures : {foo}
Index    : 0
Length   : 3
Value    : foo

Groups   : {bar}
Success  : True
Captures : {bar}
Index    : 11
Length   : 3
Value    : bar

Groups   : {ba55z}
Success  : True
Captures : {ba55z}
Index    : 22
Length   : 5
Value    : ba55z

Groups   : {1xyzzy}
Success  : True
Captures : {1xyzzy}
Index    : 28
Length   : 6
Value    : 1xyzzy
    -
Richard
  • 106,783
  • 21
  • 203
  • 265