Extract text in parentheses in R

Question

Two related questions. I have vectors of text data such as

"a(b)jk(p)"  "ipq"  "e(ijkl)"

and want to easily separate it into a vector containing the text OUTSIDE the parentheses:

"ajk"  "ipq"  "e"

and a vector containing the text INSIDE the parentheses:

"bp"   ""  "ijkl"

Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.

This post might be useful: http://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r-regex — , Mar 10 '15 at 03:26

Avinash Raj · Accepted Answer · 2015-03-10T04:18:41.040

15

Text outside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"

Text inside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp"   ""     "ijkl"

The (?<=\$)[^()]*(?=\$) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.

> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp"   ""     "ijkl"

This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.

edited Mar 10 '15 at 04:18

answered Mar 10 '15 at 03:50

Avinash Raj

172,303
28
230
274

I would assume it accidental as a downvote makes no sense. The extraction was cool because you extracted and put together without `paste`ing. +1 – Tyler Rinker Mar 10 '15 at 12:56
@TylerRinker yep, someone got angry with me and so he put 4 downvotes on my answer which has a min score of 1. My bad. – Avinash Raj Mar 10 '15 at 12:59
@TylerRinker could you provide the link to qdapRegex package? – Avinash Raj Mar 10 '15 at 13:00
To the first comment "petty" :-( To the second...Sure https://github.com/trinker/qdapRegex I linked in my answer too. It's a CRAN package as well. – Tyler Rinker Mar 10 '15 at 13:27
1

@TylerRinker qdapRegex was definitely a well put together package. – hwnd Mar 10 '15 at 22:53

Tyler Rinker · Answer 2 · 2015-03-10T13:25:57.910

7

The rm_round function in the qdapRegex package I maintain was born to do this:

First we'll get and load the package via pacman

if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)

## Then we can use it to remove and extract the parts you want:

x <-c("a(b)jk(p)", "ipq", "e(ijkl)")

rm_round(x)

## [1] "ajk" "ipq" "e" 

rm_round(x, extract=TRUE)

## [[1]]
## [1] "b" "p"
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] "ijkl"

To condense b and p use:

sapply(rm_round(x, extract=TRUE), paste, collapse="")

## [1] "bp"   "NA"   "ijkl"

edited Mar 10 '15 at 13:25

answered Mar 10 '15 at 04:44

Tyler Rinker

108,132
65
322
519

1

`regmatches(x,gregexpr("(?<=\$).+?(?=\$)",x,perl=TRUE))` for a `regmatches` version in base, with `regmatches(x,gregexpr("(?<=\\)|^).+?(?=\\(|$)",x,perl=TRUE))` for the reverse. – thelatemail Mar 10 '15 at 04:46
@thelatemail That deserves its own answer. – Tyler Rinker Mar 10 '15 at 04:48

Extract text in parentheses in R

2 Answers2

Linked

Related