Get the content from the first parenthesis in a string

Question

I have a text with several parenthesis and I would like to extract the text from the 1st parenthesis e.g : in the string bellow I would like to get "int1"

string <- "string1(int1)string2(int2)string3(int3)"

I know nothing about regular expressions and my problem is that I don't know how to stop at the first "(" and ")", in the examples bellow when I match strictly the character, it stops at the 1st in the string (ofc using sub and not gsub). But when I use ".*" before my character it matchs the last occurence of it in the string.

sub("\\(", "X", string, perl = TRUE)
#[1] "string1Xint1)string2(int2)string3(int3)"
sub(".*\\(", "X", string, perl = TRUE)
#[1] "Xint3)"
sub(".*\\)", "X", string, perl = TRUE)
#[1] "X"
sub("\\)", "X", string, perl = TRUE)
#[1] "string1(int1Xstring2(int2)string3(int3)"

So when I do something like sub(".*\\((.*)\\).*", "\\1", string, perl = TRUE) I got the string in the last parenthesis.

My first question is : How can I stop at the first "(" and ")" as in sub("\\)", ...) ?

After many tries I found a way to extract the string from the 1st parenthesis (which I'm not very sure to understand because of the grouping part with ()) :

string %>%
  sub("(\\).*$)", "\\2", ., perl = TRUE) %>% #[1] "string1(int1"
  sub(".*\\(", "", ., perl = TRUE)
#[1] "int1"

Can you advise me a better solution?

And do you know where I can find a comprehensible document about R and Perl regexp, I learn some basics from https://www.cs.tut.fi/~jkorpela/perl/regexp.html and I'm looking for more examples.

Thank You.

Maybe using `?` can solve this: `sub(".*?\\((.*?)\\).*", "\\1", string, perl = TRUE)`. Check here: https://regex101.com/r/bV7oH6/1 — Wiktor Stribiżew, Apr 22 '15 at 13:19

Avinash Raj · Accepted Answer · 2015-04-22T13:30:45.070

1

You could use regmatches function along regexpr where regexpr will do a single very first match.

> string <- "string1(int1)string2(int2)string3(int3)"
> regmatches(string, regexpr("(?<=\\()[^()]*(?=\\))", string, perl=TRUE))
[1] "int1"

OR

> regmatches(string, regexpr("(?<=\\().*?(?=\\))", string, perl=TRUE))
[1] "int1"

OR

> gsub("\\).*|^[^()]*\\(", "", string)
[1] "int1"

edited Apr 22 '15 at 13:30

answered Apr 22 '15 at 13:21

Avinash Raj

172,303
28
230
274

`regmatches()`+`regexpr()` is the right approach, I would just suggest that you replace the greedy `[^()]*` with a non-greedy `.*?`. – bgoldst Apr 22 '15 at 13:25
1

@bgoldst but you approach will return a wrong string if the input is `foo(bar(buz)` . If op wants `bar(buz` then me must follow your's else if he wants `buz` then must follow mine. Anyway i added your suggestion. – Avinash Raj Apr 22 '15 at 13:30
Good point. But I suspect he wouldn't want `buz` either, in that case. How about this: `str <- 'string1(bar(buz)foo)string2(int2)string3(int3)'; regmatches(str,regexpr('(?<=\\().*?(\\(.*?\\).*?)*(?=\\))',str,perl=T));`. – bgoldst Apr 22 '15 at 13:58
@Avinash - Can you please explain your solution in a simpler way, regmatches(string, regexpr("(?<=\\()[^()]*(?=\\))", string, perl=TRUE)) [1] "int1" – Sadhun Apr 24 '15 at 14:21
`(?<=\\()` lookbehind which matches the boundary exists next to the `(` character. `.*?` will match all the characters until the first `(?=\\)` closing bracket `)` is reached. So it matches all the characters which exists between the parenthesis. `regexpr` will only do a single match ie, the first match. `gregexpr` will do a global match. For here, this would return the chars which exists between the first parenthesis. – Avinash Raj Apr 24 '15 at 15:18

Get the content from the first parenthesis in a string

1 Answers1