Remove all text between two brackets

Question

Suppose I have some text like this,

text<-c("[McCain]: We need tax policies that respect the wage earners and job creators. [Obama]: It's harder to save. It's harder to retire. [McCain]: The biggest problem with American healthcare system is that it costs too much. [Obama]: We will have a healthcare system, not a disease-care system. We have the chance to solve problems that we've been talking about... [Text on screen]: Senators McCain and Obama are talking about your healthcare and financial security. We need more than talk. [Obama]: ...year after year after year after year. [Announcer]: Call and make sure their talk turns into real solutions. AARP is responsible for the content of this advertising.")

and I would like to remove (edit: get rid of) all of the text between the [ and ] (and the brackets themselves). What's the best way to do this? Here is my feeble attempt using regex and the stingr package:

str_extract(text, "\\[[a-z]*\\]")

Thanks for any help!

Define remove.. You want to replace it with a value, or null? Or you want to match the text inside? — hwnd, May 31 '14 at 05:24

zx81 · Accepted Answer · 2014-05-31T05:32:31.513

29

With this:

gsub("\\[[^\\]]*\\]", "", subject, perl=TRUE);

What the regex means:

  \[                       # '['
  [^\]]*                   # any character except: '\]' (0 or more
                           # times (matching the most amount possible))
  \]                       # ']'

edited May 31 '14 at 05:32

answered May 31 '14 at 05:25

zx81

41,100
9
89
105

Thanks! Appreciate the regex explanation. – Michael Davidson May 31 '14 at 05:29
2

@MichaelDavidson You're very welcome. FYI in general a negated character class such as here will be faster than a lazy dot star as in `.*?` because the engine backtracks at each step. Not a big deal in this case though, either solution is fine. :) – zx81 May 31 '14 at 05:32
1

@jbaums thanks for your prodding on this mate I'll make it up to you. :) – zx81 May 31 '14 at 05:42

score 11 · Answer 2 · answered May 31 '14 at 05:26

11

The following should do the trick. The ? forces a lazy match, which matches as few . as possible before the subsequent ].

gsub('\\[.*?\\]', '', text)

answered May 31 '14 at 05:26

jbaums

27,115
5
79
119

score 4 · Answer 3 · answered May 31 '14 at 07:42

4

Here'a another approach:

library(qdap)
bracketX(text, "square")

answered May 31 '14 at 07:42

Tyler Rinker

108,132
65
322
519

score 4 · Answer 4 · answered Aug 16 '18 at 19:46

I think this technically answers what you've asked, but you probably want to add a \\: to the end of the regex for prettier text (removing the colon and space).

library(stringr)
str_replace_all(text, "\\[.+?\\]", "")

#> [1] ": We need tax policies that respect the wage earners..."

vs...

str_replace_all(text, "\\[.+?\\]\\: ", "")
#> [1] "We need tax policies that respect the wage earners..."

Created on 2018-08-16 by the reprex package (v0.2.0).

Wiktor Stribiżew · Answer 5 · 2019-03-18T20:26:51.217

No need to use a PCRE regex with a negated character class / bracket expression, a "classic" TRE regex will work, too:

subject <- "Some [string] here and [there]"
gsub("\\[[^][]*]", "", subject)
## => [1] "Some  here and "

See the online R demo

Details:

\\[ - a literal [ (must be escaped or used inside a bracket expression like [[] to be parsed as a literal [)
[^][]* - a negated bracket expression that matches 0+ chars other than [ and ] (note that the ] at the start of the bracket expression is treated as a literal ])
] - a literal ] (this character is not special in both PCRE and TRE regexps and does not have to be escaped).

If you want to only replace the square brackets with some other delimiters, use a capturing group with a backreference in the replacement pattern:

gsub("\\[([^][]*)\\]", "{\\1}", subject)
## => [1] "Some {string} here and {there}"

See another demo

The (...) parenthetical construct forms a capturing group, and its contents can be accessed with a backreference \1 (as the group is the first one in the pattern, its ID is set to 1).

Remove all text between two brackets

5 Answers5

Linked

Related