regex only match content outside all kinds of brackets

Question

For this string: hello (hi that [is] so cool) awesome {yeah} I want the regex to only match the hello and awesome.

This is what I have tried so far, and it seems not to work. https://regex101.com/r/NsUfQR/1

([^\[\]\{\}［］（）〔〕〈〉【】]+)(?![^()\[\]\{\}［］（）〔〕〈〉【】]*[\)\]）〕〉】］])

This matches hello hi that awesome yeah which is too many.

Is it possible to achieve this using only Regex or maybe there's another way using perl or python?

First come, first serve ? Nesting too ? Balance text ? Saying _outside_ means a little more. — , Oct 08 '20 at 00:24
Yeah, the brackets, they could nest each other. The aim is no matter which brackets are in the most outside, everything within it should not match. — jonah_w, Oct 08 '20 at 00:26
That's why I said first come, serve. I mean if the inner text is _unbalanced_ it doesn't matter, i.e `here (not[here]]or there) ok`. Are you using Perl or Python ? — , Oct 08 '20 at 00:28

score 3 · Accepted Answer · 2020-10-08T15:53:20.857

This regex just uses the normal text brackets (),[],{}
You can add your own, just copy a block, paste it and change the delimiter
brackets that you want. Pay attention to the recursion groups.
Add the leading bracket in the exclusion list.
Also notice there is a fall through [\S\s] at the end to pick up any strays.

update Added all your bracket types (from comment).

/(?:[^$\[{〈【〔（［]+|(?:(\((?>[^()]++|(?1))*$)|({(?>[^{}]++|(?2))*})|(\[(?>[^\[\]]++|(?3))*\])|(〈(?>[^〈〉]++|(?4))*〉)|(【(?>[^【】]++|(?5))*】)|(〔(?>[^〔〕]++|(?6))*〕)|(（(?>[^（）]++|(?7))*）)|(［(?>[^［］]++|(?8))*］))(*SKIP)(*FAIL)|[\S\s])/
https://regex101.com/r/LUXJVu/1

 (?:
    [^\(\[{〈【〔（［]+ 
  | 
    (?:
       (                   # (1 start), Left/Right parenthesis
          \(    
          (?>
             [^()]++ 
           | (?1) 
          )*
          \)                     
       )                   # (1 end)
     | 
       (                   # (2 start), Left/Right curly bracket
          {
          (?>
             [^{}]++ 
           | (?2) 
          )*
          }
       )                   # (2 end)
     | 
       (                   # (3 start), Left/Right square bracket
          \[ 
          (?>
             [^\[\]]++ 
           | (?3) 
          )*
          \] 
       )                   # (3 end)
     | 
       (                   # (4 start), Left/Right angle bracket
          〈
          (?>
             [^〈〉]++ 
           | (?4) 
          )*
          〉
       )                   # (4 end)
     | 
       (                   # (5 start), Left/Right black lenticular bracket
          【
          (?>
             [^【】]++ 
           | (?5) 
          )*
          】
       )                   # (5 end)
     | 
       (                   # (6 start), Left/Right tortoise bracket
          〔
          (?>
             [^〔〕]++ 
           | (?6) 
          )*
          〕
       )                   # (6 end)
     | 
       (                   # (7 start), Left/Right fullwidth parenthesis
          （
          (?>
             [^（）]++ 
           | (?7) 
          )*
          ）
       )                   # (7 end)
     | 
       (                   # (8 start), Left/Right fullwidth square bracket
          ［
          (?>
             [^［］]++ 
           | (?8) 
          )*
          ］
       )                   # (8 end)
    )
    (*SKIP) (*FAIL) 
  | 
    [\S\s] 
 )

After I added the rest of brackets, and tested in a perl one-liner: `perl -E'$a="hi (hello) world"; @test = $a =~ /((?:[^$\[\{〈【〔（［]+|((\((?>[^()]++|(?1))*$)|(\{(?>[^\{\}]++|(?2))*\})|(\[(?>[^\[\]]++|(?3))*\])|(〈(?>[^〈〉]++|(?4))*〉)|(【(?>[^【】]++|(?5))*】)|(〔(?>[^〔〕]++|(?6))*〕)|(（(?>[^（）]++|(?7))*）)|(［(?>[^［］]++|(?8))*］)|)(*SKIP)(*FAIL)|[\s\S]))/g; say for@test;'` It seems to only print a `hi`, should be `hi` and `world`. Or could be I'm writing something wrong? — jonah_w, Oct 08 '20 at 03:25
it works on https://regex101.com/r/oFFSoa/2 though. just not work in the perl one-liner. — jonah_w, Oct 08 '20 at 03:31
@jonah_w The capture groups in your regex just got a little out of shape is all. See update for the better regex. — , Oct 08 '20 at 15:37

zdim · Answer 2 · 2020-10-22T17:10:24.130

This gets into the thorny business of dealing with matching delimiters, possibly nested.

Instead of tangling a grand regex I'd suggest to parse the string for text which is outside of all pairs of balanced (top-level) brackets, precisely what is described in the question, using the core Text::Balanced

use warnings;
use strict;
use feature 'say';

use Text::Balanced qw(extract_bracketed);

my $string = 'hello (hi that [is] so cool) awesome {yeah}';

my @outside_of_brackets;

my ($match, $before);
my $remainder = $string;
while (1) {
    ($match, $remainder, $before) = extract_bracketed(
        $remainder, '(){}[]', '[^({[]*'
    );
    push @outside_of_brackets, $before // $remainder;
    last if not defined $match; 
}

say for @outside_of_brackets;

We ask to find the contents of the first top-level pair of any of the given brackets,^† and along with that we get what follows the pair ($remainder) and what was before it.

It is $before that is needed here, and we keep parsing the $remainder the same way, picking $before's, until there's no more matches; at that point the $remainder has no brackets in it so we take it as well (at that point $before must be empty as well).

The code gets expected strings, with some extra white space; trim as needed.

For another example, and for another approach using Regexp::Common, see this post.

^† The extract_bracketed extracts what's in the first top-level balanced pair of brackets, that by default need be found at the beginning of the string (after possible spaces), or right after the end of its previous match; or, after the pattern in the third argument (if given), which then must be found (thus the * quantifier here, in case the brackets are at the beginning).

So in this case it skips up to the first opening bracket and then parses the string to look for a balanced bracket pair. Types of brackets to seek are given as its second argument.

score 1 · Answer 3 · answered Oct 08 '20 at 00:50

my $string = 'hello (hi that [is] so cool) awesome {yeah} <and <then> some (even {more})>';
1 while $string =~ s/\([^(]*?\) *//;  #remove all ()
1 while $string =~ s/\[[^\[]*?\] *//; #remove all []
1 while $string =~ s/\{[^{]*?\} *//;  #remove all {}
1 while $string =~ s/<[^<]*?> *//;    #remove all <>
print "What is left now: $string\n";  #hello awesome

Or all-in-one:

1 while $string=~s/( \([^(]*?\) | \[[^[]*?\] | \{[^{]*?\} | <[^<]*?>  ) \s*//xg;

regex only match content outside all kinds of brackets

3 Answers3

Linked

Related