3

For this string: hello (hi that [is] so cool) awesome {yeah} I want the regex to only match the hello and awesome.

This is what I have tried so far, and it seems not to work. https://regex101.com/r/NsUfQR/1

([^\(\)\[\]\{\}[]()〔〕〈〉【】]+)(?![^()\[\]\{\}[]()〔〕〈〉【】]*[\)\])〕〉】]])

This matches hello hi that awesome yeah which is too many.

Is it possible to achieve this using only Regex or maybe there's another way using perl or python?

jonah_w
  • 972
  • 5
  • 11
  • First come, first serve ? Nesting too ? Balance text ? Saying _outside_ means a little more. –  Oct 08 '20 at 00:24
  • Yeah, the brackets, they could nest each other. The aim is no matter which brackets are in the most outside, everything within it should not match. – jonah_w Oct 08 '20 at 00:26
  • That's why I said first come, serve. I mean if the inner text is _unbalanced_ it doesn't matter, i.e `here (not[here]]or there) ok`. Are you using Perl or Python ? –  Oct 08 '20 at 00:28
  • I'm using Perl. Perl is preferable. :) – jonah_w Oct 08 '20 at 00:29

3 Answers3

3

This regex just uses the normal text brackets (),[],{}
You can add your own, just copy a block, paste it and change the delimiter
brackets that you want. Pay attention to the recursion groups.
Add the leading bracket in the exclusion list.
Also notice there is a fall through [\S\s] at the end to pick up any strays.

update Added all your bracket types (from comment).

/(?:[^\(\[{〈【〔([]+|(?:(\((?>[^()]++|(?1))*\))|({(?>[^{}]++|(?2))*})|(\[(?>[^\[\]]++|(?3))*\])|(〈(?>[^〈〉]++|(?4))*〉)|(【(?>[^【】]++|(?5))*】)|(〔(?>[^〔〕]++|(?6))*〕)|(((?>[^()]++|(?7))*))|([(?>[^[]]++|(?8))*]))(*SKIP)(*FAIL)|[\S\s])/
https://regex101.com/r/LUXJVu/1

 (?:
    [^\(\[{〈【〔([]+ 
  | 
    (?:
       (                   # (1 start), Left/Right parenthesis
          \(    
          (?>
             [^()]++ 
           | (?1) 
          )*
          \)                     
       )                   # (1 end)
     | 
       (                   # (2 start), Left/Right curly bracket
          {
          (?>
             [^{}]++ 
           | (?2) 
          )*
          }
       )                   # (2 end)
     | 
       (                   # (3 start), Left/Right square bracket
          \[ 
          (?>
             [^\[\]]++ 
           | (?3) 
          )*
          \] 
       )                   # (3 end)
     | 
       (                   # (4 start), Left/Right angle bracket
          〈
          (?>
             [^〈〉]++ 
           | (?4) 
          )*
          〉
       )                   # (4 end)
     | 
       (                   # (5 start), Left/Right black lenticular bracket
          【
          (?>
             [^【】]++ 
           | (?5) 
          )*
          】
       )                   # (5 end)
     | 
       (                   # (6 start), Left/Right tortoise bracket
          〔
          (?>
             [^〔〕]++ 
           | (?6) 
          )*
          〕
       )                   # (6 end)
     | 
       (                   # (7 start), Left/Right fullwidth parenthesis
          (
          (?>
             [^()]++ 
           | (?7) 
          )*
          )
       )                   # (7 end)
     | 
       (                   # (8 start), Left/Right fullwidth square bracket
          [
          (?>
             [^[]]++ 
           | (?8) 
          )*
          ]
       )                   # (8 end)
    )
    (*SKIP) (*FAIL) 
  | 
    [\S\s] 
 )
  • After I added the rest of brackets, and tested in a perl one-liner: `perl -E'$a="hi (hello) world"; @test = $a =~ /((?:[^\(\[\{〈【〔([]+|((\((?>[^()]++|(?1))*\))|(\{(?>[^\{\}]++|(?2))*\})|(\[(?>[^\[\]]++|(?3))*\])|(〈(?>[^〈〉]++|(?4))*〉)|(【(?>[^【】]++|(?5))*】)|(〔(?>[^〔〕]++|(?6))*〕)|(((?>[^()]++|(?7))*))|([(?>[^[]]++|(?8))*])|)(*SKIP)(*FAIL)|[\s\S]))/g; say for@test;'` It seems to only print a `hi`, should be `hi` and `world`. Or could be I'm writing something wrong? – jonah_w Oct 08 '20 at 03:25
  • it works on https://regex101.com/r/oFFSoa/2 though. just not work in the perl one-liner. – jonah_w Oct 08 '20 at 03:31
  • @jonah_w The capture groups in your regex just got a little out of shape is all. See update for the better regex. –  Oct 08 '20 at 15:37
3

This gets into the thorny business of dealing with matching delimiters, possibly nested.

Instead of tangling a grand regex I'd suggest to parse the string for text which is outside of all pairs of balanced (top-level) brackets, precisely what is described in the question, using the core Text::Balanced

use warnings;
use strict;
use feature 'say';

use Text::Balanced qw(extract_bracketed);

my $string = 'hello (hi that [is] so cool) awesome {yeah}';

my @outside_of_brackets;

my ($match, $before);
my $remainder = $string;
while (1) {
    ($match, $remainder, $before) = extract_bracketed(
        $remainder, '(){}[]', '[^({[]*'
    );
    push @outside_of_brackets, $before // $remainder;
    last if not defined $match; 
}

say for @outside_of_brackets;

We ask to find the contents of the first top-level pair of any of the given brackets, and along with that we get what follows the pair ($remainder) and what was before it.

It is $before that is needed here, and we keep parsing the $remainder the same way, picking $before's, until there's no more matches; at that point the $remainder has no brackets in it so we take it as well (at that point $before must be empty as well).

The code gets expected strings, with some extra white space; trim as needed.

For another example, and for another approach using Regexp::Common, see this post.


The extract_bracketed extracts what's in the first top-level balanced pair of brackets, that by default need be found at the beginning of the string (after possible spaces), or right after the end of its previous match; or, after the pattern in the third argument (if given), which then must be found (thus the * quantifier here, in case the brackets are at the beginning).

So in this case it skips up to the first opening bracket and then parses the string to look for a balanced bracket pair. Types of brackets to seek are given as its second argument.

zdim
  • 64,580
  • 5
  • 52
  • 81
1
my $string = 'hello (hi that [is] so cool) awesome {yeah} <and <then> some (even {more})>';
1 while $string =~ s/\([^(]*?\) *//;  #remove all ()
1 while $string =~ s/\[[^\[]*?\] *//; #remove all []
1 while $string =~ s/\{[^{]*?\} *//;  #remove all {}
1 while $string =~ s/<[^<]*?> *//;    #remove all <>
print "What is left now: $string\n";  #hello awesome

Or all-in-one:

1 while $string=~s/( \([^(]*?\) | \[[^[]*?\] | \{[^{]*?\} | <[^<]*?>  ) \s*//xg;
Kjetil S.
  • 3,468
  • 20
  • 22