6

Hard to word this correctly, but TL;DR.

I want to match, in a given text sentence (let's say "THE TREE IS GREEN") if any space is doubled (or more).

Example:

"In this text,
THE TREE IS GREEN should not match,
THE  TREE IS GREEN should
and so should THE  TREE   IS GREEN
but  double-spaced  TEXT  SHOULD  NOT BE  FLAGGED outside the pattern."

My initial approach would be

/THE( {2,})TREE( {2,})IS( {2,})GREEN/

but this only matches if all spaces are double in the sequence, therefore I'd like to make any of the groups trigger a full match. Am I going the wrong way, or is there a way to make this work?

4 Answers4

4

You can use Negative lookahead if there is an option.

First match the sentence that you want to fail, in your case, it is "THE TREE IS GREEN" then give the most generic case that wants to catch your desired result.

(?!THE TREE IS GREEN)(THE[ ]+TREE[ ]+IS[ ]+GREEN)

https://regex101.com/r/EYDU6g/2

karthick
  • 11,998
  • 6
  • 56
  • 88
2

You can just search for the spaces that you're looking for:

/ {2,}/ will work to match two or more of the space character. (https://regexr.com/4h4d4)

You can capture the results by surrounding it with parenthesis - /( {2,})/

You may want to broaden it a bit.
/\s{2,}/ will match any doubling of whitespace. (\s - means any whitespace - space, tab, newline, etc.)

No need to match the whole string, just the piece that's of interest.

Laizer
  • 5,932
  • 7
  • 46
  • 73
  • Thank you for the example, but since I want to quickly scan text for this pattern conditionally occurring in a sequence of words, like "the tree is green". others I don't mind being double-spaced, so I am looking for a one-off solution only using regex. – Alessandro Jeanteur Jul 08 '19 at 20:20
0

If I am not mistaken you want the whole match if there is a part present where there are 2 or more spaces between 2 uppercased parts.

If that is the case, you might use:

^.*[A-Z]+ {2,}[A-Z]+.*$
  • ^ Start of string
  • .*[A-Z]+ match any char except a newline 0+ time, then match 1+ times [A-Z]
  • [ ]{2,} Match 2 or more times a space (used square brackets for clarity)
  • A-Z+ Match 1+ times an uppercase char
  • .*$ Match any char except a newline 0+ times until the end of the string

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This works in my example because I used uppercase for clearness, but does not work for a specific word sequence. Edited my example, your regex incorrectly flags double-spaced uppercase words outside the base pattern. – Alessandro Jeanteur Jul 08 '19 at 20:19
  • So you mean that the pattern is always `THE TREE IS GREEN` and the whole sentence should match if there is at least a single match for a double space between the words of the pattern? – The fourth bird Jul 08 '19 at 20:24
  • I mean for a given sentence like 'THE TREE IS GREEN' I'd like a pattern that will match itself (whole sentence can work but not necessary) if it contains any double space between those words. As @3limin4t0r, something like `/THE {2,}TREE +IS +GREEN|THE +TREE {2,}IS +GREEN|THE +TREE +IS {2,}GREEN/gm` works but is already quite inelegant and doesn't scale well to larger sentences – Alessandro Jeanteur Jul 08 '19 at 20:38
0

You could do this:

import re

pattern = r"THE +TREE +IS +GREEN"

test_str = ("In this text,\n"
    "THE TREE IS GREEN should not match,\n"
    "THE  TREE IS GREEN should\n"
    "and so should THE TREE   IS GREEN\n"
    "but  double-spaced  TEXT  SHOULD  NOT BE  FLAGGED outside the pattern.")

matches = re.finditer(pattern, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    if match.group() != 'THE TREE IS GREEN':
        print ("{match}".format(match = match.group()))
SanV
  • 855
  • 8
  • 16
  • After I posted this, noticed that this along the same lines as @karthick's answer above – SanV Jul 08 '19 at 21:12