7

Let's say I have a string:

x <- "This is a string (Yay, string!)" 

I'd like to parse the string and return "Yay, string!"

How do I do that?

I tried a bunch of grep/grepl/gsub/sub/etc but couldn't find the right combination of regex or arguments. Sigh. I need to work on the regex skills.

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • possible duplicate of [Extract info inside all parenthesis in R (regex)](http://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r-regex) – Tyler Rinker Sep 10 '12 at 21:58
  • Definitely a dupe, but the answers seem different. – Brandon Bertelsen Sep 11 '12 at 06:26
  • 2
    `strapplyc` in the gsubfn package handles problems like that. The regular expression in the following code matches `(` followed by any number of characters that are not `)` and returns the part within parentheses: `library(gsubfn); strapplyc(x, "\\(([^)]*)", simplify = TRUE)` By default it uses tcl regular expressions which are quite fast, e.g. check the examples in `?strapplyc` for the one which parses the entire text of James Joyce's Ulysses in seconds. Regarding learning about regex's, there are links to regex resources on the gsubfn home page http://gsubfn.googlecode.com . – G. Grothendieck Sep 11 '12 at 10:36

3 Answers3

9

Here are two ways of doing it:

One: Find the string you want, and replace the entire string with the bit that was found. (Known as back referencing)

gsub(".*\\((.*)\\).*", "\\1", x)
[1] "Yay, string!"

This works because:

  • You use a backreference \\1 to refer to the matched string in the parentheses (.*)
  • Since you want to exclude the parentheses in the actual string, you need to escape these with \\( and \\).

Two: Replace all the bits you don't want with empty strings:

gsub(".*\\(|\\).*", "", x)
[1] "Yay, string!"

This works because the | acts similar to OR.

Andrie
  • 176,377
  • 47
  • 447
  • 496
  • Is there some documentation for the `\\1` usage? I read about it at the bottom of ?grep but didn't understand it. – Brandon Bertelsen Sep 10 '12 at 21:32
  • 2
    @BrandonBertelsen I've expanded the answer slightly, but in general my advice is to learn regex from anywhere but the R docs. For example, here is a [tutorial on backreferences](http://www.regular-expressions.info/brackets.html) – Andrie Sep 10 '12 at 21:36
  • @BrandonBertelsen I've also added the opposite approach, i.e. replacing the bits you don't want with empty strings. – Andrie Sep 10 '12 at 21:40
  • Appreciate the extra attention here and the link to the tutorial. – Brandon Bertelsen Sep 11 '12 at 06:29
5

Also, if some of your strings might contain several parenthesized substrings, all of which you want to extract, use the regex power-tools gregexpr() and regmatches():

x <- "This is (a) string (Yay, string!)" 
pat <- "(?<=\\()([^()]*)(?=\\))"
regmatches(x, gregexpr(pat, x, perl=TRUE))
# [[1]]
# [1] "a"            "Yay, string!"
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • And just like that, I now see that my **answer** is an exact duplicate of the one referenced a moment ago by Tyler Rinker above! – Josh O'Brien Sep 10 '12 at 22:04
  • That was the basis of the `bracketX` and `bracketXtract` functions in `qdap`. I asked a question that is similar and provides even more detail here: http://stackoverflow.com/questions/8621066/remove-text-inside-brackets-parens-and-or-braces – Tyler Rinker Sep 10 '12 at 22:35
  • I saw that question, but it was far too complex for my limited understanding of regex. – Brandon Bertelsen Sep 11 '12 at 06:28
  • 1
    You mean you don't speak `"(?<=\\()([^()]*)(?=\\))"`? ;-) – Josh O'Brien Sep 11 '12 at 18:03
3

qdap version 1.1.0 can do this:

library(qdap)
x <- "This is a string (Yay, string!)" 

bracketX(x)
bracketXtract(x)

Yields:

> bracketX(x)
[1] "This is a string"
> bracketXtract(x)
[1] "Yay, string!"

Though if you're not doing much of this stuff then getting qdap may be a bit of overkill.

Edit: With Josh's example...

> x <- "This is (a) string (Yay, string!)" 
> bracketX(x)
[1] "This is string"
> bracketXtract(x)
[1] "a"            "Yay, string!"
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519