-1

I have a text with several parenthesis and I would like to extract the text from the 1st parenthesis e.g : in the string bellow I would like to get "int1"

string <- "string1(int1)string2(int2)string3(int3)"

I know nothing about regular expressions and my problem is that I don't know how to stop at the first "(" and ")", in the examples bellow when I match strictly the character, it stops at the 1st in the string (ofc using sub and not gsub). But when I use ".*" before my character it matchs the last occurence of it in the string.

sub("\\(", "X", string, perl = TRUE)
#[1] "string1Xint1)string2(int2)string3(int3)"
sub(".*\\(", "X", string, perl = TRUE)
#[1] "Xint3)"
sub(".*\\)", "X", string, perl = TRUE)
#[1] "X"
sub("\\)", "X", string, perl = TRUE)
#[1] "string1(int1Xstring2(int2)string3(int3)"

So when I do something like sub(".*\\((.*)\\).*", "\\1", string, perl = TRUE) I got the string in the last parenthesis.

My first question is : How can I stop at the first "(" and ")" as in sub("\\)", ...) ?

After many tries I found a way to extract the string from the 1st parenthesis (which I'm not very sure to understand because of the grouping part with ()) :

string %>%
  sub("(\\).*$)", "\\2", ., perl = TRUE) %>% #[1] "string1(int1"
  sub(".*\\(", "", ., perl = TRUE)
#[1] "int1"

Can you advise me a better solution?

And do you know where I can find a comprehensible document about R and Perl regexp, I learn some basics from https://www.cs.tut.fi/~jkorpela/perl/regexp.html and I'm looking for more examples.

Thank You.

Julien Navarre
  • 7,653
  • 3
  • 42
  • 69

1 Answers1

1

You could use regmatches function along regexpr where regexpr will do a single very first match.

> string <- "string1(int1)string2(int2)string3(int3)"
> regmatches(string, regexpr("(?<=\\()[^()]*(?=\\))", string, perl=TRUE))
[1] "int1"

OR

> regmatches(string, regexpr("(?<=\\().*?(?=\\))", string, perl=TRUE))
[1] "int1"

OR

> gsub("\\).*|^[^()]*\\(", "", string)
[1] "int1"
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • `regmatches()`+`regexpr()` is the right approach, I would just suggest that you replace the greedy `[^()]*` with a non-greedy `.*?`. – bgoldst Apr 22 '15 at 13:25
  • 1
    @bgoldst but you approach will return a wrong string if the input is `foo(bar(buz)` . If op wants `bar(buz` then me must follow your's else if he wants `buz` then must follow mine. Anyway i added your suggestion. – Avinash Raj Apr 22 '15 at 13:30
  • Good point. But I suspect he wouldn't want `buz` either, in that case. How about this: `str <- 'string1(bar(buz)foo)string2(int2)string3(int3)'; regmatches(str,regexpr('(?<=\\().*?(\\(.*?\\).*?)*(?=\\))',str,perl=T));`. – bgoldst Apr 22 '15 at 13:58
  • @Avinash - Can you please explain your solution in a simpler way, regmatches(string, regexpr("(?<=\\()[^()]*(?=\\))", string, perl=TRUE)) [1] "int1" – Sadhun Apr 24 '15 at 14:21
  • `(?<=\\()` lookbehind which matches the boundary exists next to the `(` character. `.*?` will match all the characters until the first `(?=\\)` closing bracket `)` is reached. So it matches all the characters which exists between the parenthesis. `regexpr` will only do a single match ie, the first match. `gregexpr` will do a global match. For here, this would return the chars which exists between the first parenthesis. – Avinash Raj Apr 24 '15 at 15:18