1

I have this regex which detects all words:

\b[^\d\W]+\b

And I have this regex to detect quoted texts:

\'[^\".]*?\'|\"[^\'.]*?\"

Is there a regex which can detect words which are not in quotes(both single and double)?

example:

import re
a = "big mouse eats cheese? \"non-detected string\" 'non-detected string too' hello guys"
re.findall(some_regex, a)

It should output this ['big', 'mouse', 'eats', 'cheese', 'hello', 'guys']

I know I can use re.sub() to detect the quoted text and then replace it with a blank string but thats what I don't want to do.

I also looked up this page regex match keywords that are not in quotes and tried this (^([^"]|"[^"]*")*)|(^([^']|'[^']*')*) but it didn't work A regex to detect string not enclosed in double quotes also tried this (?<![\S"])([^"\s]+)(?![\S"])|(?<![\S'])([^'\s]+)(?![\S']) both detected all words

Ibrahim
  • 798
  • 6
  • 26

1 Answers1

1

You can use

import re
a = '''big mouse eats cheese? "non-detected string" 'non-detected string too' hello guys'''
print( [x for x in re.findall(r'''"[^"]*"|'[^']*'|\b([^\d\W]+)\b''', a) if x])
# => ['big', 'mouse', 'eats', 'cheese', 'hello', 'guys']

See the Python demo. The list comprehension is used to post-process the output to remove empty items that result from matching the quoted substrings.

This approach works because re.findall only returns the captured substrings when the capturing group is defined in the regex. "[^"]*"|'[^']*' part matches but does not capture strings between single and double quotes, and the \b([^\d\W]+)\b part matches and captures into Group 1 any one or more letters or underscores in between word boundaries.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • so if i try doing `re.compile().search()` will it work? as you said it works because re.findall() behaves weird – Ibrahim Aug 15 '21 at 15:32
  • @Good To get the first match only, use `re.search`, but you need to check if there is a match first, and then access `match.group(1)` value. However, with this approach, you need to use `re.findall` to get all matches as `re.search` might yield an empty match. So, get all matches using the suggested approach and then use indexing to get the first non-empty match if necessary. – Wiktor Stribiżew Aug 15 '21 at 15:32