3

Is there a regular expression that can be used with search/replace to delete everything occurring within square brackets (and the brackets)?

I've tried \[.*\] which chomps extra stuff (e.g. "[chomps] extra [stuff]")

Also, the same thing with lazy matching \[.*?\] doesn't work when there is a nested bracket (e.g. "stops [chomping [too] early]!")

xanatos
  • 109,618
  • 12
  • 197
  • 280
ajwood
  • 18,227
  • 15
  • 61
  • 104
  • 1
    What language are you using Regexes? In general your problem (recursive matching) can be solved only by some languages (.NET, Perl and someone else, not JS, not Java) – xanatos Mar 23 '11 at 19:38
  • Here there is an example where they use < and > http://compgroups.net/comp.lang.perl.misc/FAQ-6.12-Can-I-use-Perl-regular-expressions-to-match-balanced-text,3 you can convert it easily to match [ and ] . I don't do Perl so I won't write an answer. Another link I'll add this http://stackoverflow.com/questions/4445674/can-i-use-perl-regular-expressions-to-match-balanced-text – xanatos Mar 23 '11 at 19:45
  • The best I can think of is sticking a regex in a while loop and delete square braketted stuff (with no nesting) until there are no square brackets left. Can Perl do better? – ajwood Mar 23 '11 at 19:46
  • yes, see my, and Tim's answer. – Bart Kiers Mar 23 '11 at 20:06

5 Answers5

11

Try something like this:

$text = "stop [chomping [too] early] here!";
$text =~ s/\[([^\[\]]|(?0))*]//g;
print($text);

which will print:

stop  here!

A short explanation:

\[            # match '['
(             # start group 1
  [^\[\]]     #   match any char except '[' and ']'
  |           #   OR
  (?0)        #   recursively match group 0 (the entire pattern!)
)*            # end group 1 and repeat it zero or more times
]             # match ']'

The regex above will get replaced with an empty string.

You can test it online: http://ideone.com/tps8t

EDIT

As @ridgerunner mentioned, you can make the regex more efficiently by making the * and the character class [^\[\]] match once or more and make it possessive, and even by making a non capturing group from group 1:

\[(?:[^\[\]]++|(?0))*+]

But a real improvement in speed might only be noticeable when working with large strings (you can test it, of course!).

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • 2
    +1 And this expression can be made more efficient by adding a possessive plus to the first alternative: `\[([^\[\]]++|(?0))*]` – ridgerunner Mar 24 '11 at 01:38
5

This is technically not possible with regular expressions because the language you're matching does not meet the definition of "regular". There are some extended regex implementations that can do it anyway using recursive expressions, among them are:

Greta:

http://easyethical.org/opensource/spider/regexp%20c++/greta2.htm#_Toc39890907

and

PCRE

http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions

See "Recursive Patterns", which has an example for parentheses.

A PCRE recursive bracket match would look like this:

\[(?R)*\]

edit:

Since you added that you're using Perl, here's a page that explicitly describes how to match balanced pairs of operators in Perl:

http://perldoc.perl.org/perlfaq6.html#Can-I-use-Perl-regular-expressions-to-match-balanced-text%3f

Something like:

$string =~ m/(\[(?:[^\[\]]++|(?1))*\])/xg;
Tim Sylvester
  • 22,897
  • 2
  • 80
  • 94
4

Since you're using Perl, you can use modules from the CPAN and not have to write your own regular expressions. Check out the Text::Balanced module that allows you to extract text from balanced delimiters. Using this module means that if your delimiters suddenly change to {}, you don't have to figure out how to modify a hairy regular expression, you only have to change the delimiter parameter in one function call.

CanSpice
  • 34,814
  • 10
  • 72
  • 86
3

If you are only concerned with deleting the contents and not capturing them to use elsewhere you can use a repeated removal from the inside of the nested groups to the outside.

my $string = "stops [chomping [too] early]!";
# remove any [...] sequence that doesn't contain a [...] inside it
# and keep doing it until there are no [...] sequences to remove
1 while $string =~ s/\[[^\[\]]*\]//g; 
print $string;

The 1 while will basically do nothing while the condition is true. If a s/// matches and removes a bracketed section the loop is repeated and the s/// is run again.

This will work even if your using an older version of Perl or another language that doesn't support the (?0) recursion extended pattern in Bart Kiers's answer.

Ven'Tatsu
  • 3,565
  • 16
  • 18
1

You want to remove only things between the []s that aren't []s themselves. IE:

\[[^\]]*\]

Which is a pretty hairy mess of []s ;-)

It won't handle multiple nested []s though. IE, matching [foo[bar]baz] won't work.

Wes Hardaker
  • 21,735
  • 2
  • 38
  • 69