How to use a regex to match if any pattern appears once out of many times in a given sequence

Question

Hard to word this correctly, but TL;DR.

I want to match, in a given text sentence (let's say "THE TREE IS GREEN") if any space is doubled (or more).

Example:

"In this text,
THE TREE IS GREEN should not match,
THE  TREE IS GREEN should
and so should THE  TREE   IS GREEN
but  double-spaced  TEXT  SHOULD  NOT BE  FLAGGED outside the pattern."

My initial approach would be

/THE( {2,})TREE( {2,})IS( {2,})GREEN/

but this only matches if all spaces are double in the sequence, therefore I'd like to make any of the groups trigger a full match. Am I going the wrong way, or is there a way to make this work?

I'll be using the Node 10 engine for this, but I'm curious to see other variants on other engines. — Alessandro Jeanteur, Jul 08 '19 at 21:12

karthick · Accepted Answer · 2019-07-09T16:39:01.237

4

You can use Negative lookahead if there is an option.

First match the sentence that you want to fail, in your case, it is "THE TREE IS GREEN" then give the most generic case that wants to catch your desired result.

(?!THE TREE IS GREEN)(THE[ ]+TREE[ ]+IS[ ]+GREEN)

https://regex101.com/r/EYDU6g/2

edited Jul 09 '19 at 16:39

answered Jul 08 '19 at 20:52

karthick

11,998
6
56
88

yeah was about to change it to ( +) – karthick Jul 08 '19 at 21:03
Negative lookahead seems like the way to go! Clean and elegant, accepted. – Alessandro Jeanteur Jul 08 '19 at 21:10
FYI The solution and the regex101 link doesn't match – Sano Jul 09 '19 at 02:05

Laizer · Answer 2 · 2019-07-08T19:40:11.550

2

You can just search for the spaces that you're looking for:

/ {2,}/ will work to match two or more of the space character. (https://regexr.com/4h4d4)

You can capture the results by surrounding it with parenthesis - /( {2,})/

You may want to broaden it a bit.
/\s{2,}/ will match any doubling of whitespace. (\s - means any whitespace - space, tab, newline, etc.)

No need to match the whole string, just the piece that's of interest.

edited Jul 08 '19 at 19:40

answered Jul 08 '19 at 19:34

Laizer

5,932
7
46
73

Thank you for the example, but since I want to quickly scan text for this pattern conditionally occurring in a sequence of words, like "the tree is green". others I don't mind being double-spaced, so I am looking for a one-off solution only using regex. – Alessandro Jeanteur Jul 08 '19 at 20:20

score 0 · Answer 3 · answered Jul 08 '19 at 19:53

0

If I am not mistaken you want the whole match if there is a part present where there are 2 or more spaces between 2 uppercased parts.

If that is the case, you might use:

^.*[A-Z]+ {2,}[A-Z]+.*$

^ Start of string
.*[A-Z]+ match any char except a newline 0+ time, then match 1+ times [A-Z]
[ ]{2,} Match 2 or more times a space (used square brackets for clarity)
A-Z+ Match 1+ times an uppercase char
.*$ Match any char except a newline 0+ times until the end of the string

Regex demo

answered Jul 08 '19 at 19:53

The fourth bird

154,723
16
55
70

This works in my example because I used uppercase for clearness, but does not work for a specific word sequence. Edited my example, your regex incorrectly flags double-spaced uppercase words outside the base pattern. – Alessandro Jeanteur Jul 08 '19 at 20:19
So you mean that the pattern is always `THE TREE IS GREEN` and the whole sentence should match if there is at least a single match for a double space between the words of the pattern? – The fourth bird Jul 08 '19 at 20:24
I mean for a given sentence like 'THE TREE IS GREEN' I'd like a pattern that will match itself (whole sentence can work but not necessary) if it contains any double space between those words. As @3limin4t0r, something like `/THE {2,}TREE +IS +GREEN|THE +TREE {2,}IS +GREEN|THE +TREE +IS {2,}GREEN/gm` works but is already quite inelegant and doesn't scale well to larger sentences – Alessandro Jeanteur Jul 08 '19 at 20:38

score 0 · Answer 4 · answered Jul 08 '19 at 21:10

You could do this:

import re

pattern = r"THE +TREE +IS +GREEN"

test_str = ("In this text,\n"
    "THE TREE IS GREEN should not match,\n"
    "THE  TREE IS GREEN should\n"
    "and so should THE TREE   IS GREEN\n"
    "but  double-spaced  TEXT  SHOULD  NOT BE  FLAGGED outside the pattern.")

matches = re.finditer(pattern, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    if match.group() != 'THE TREE IS GREEN':
        print ("{match}".format(match = match.group()))

After I posted this, noticed that this along the same lines as @karthick's answer above — SanV, Jul 08 '19 at 21:12

How to use a regex to match if any pattern appears once out of many times in a given sequence

4 Answers4