2

In R, I have a variable Author, with the value "(Bernoulli)Cuatrec."

I want to have only the names, so I'm using the following regex:

L <- strsplit(Author,"[()]")

but that's giving me 3 strings as result:

""          "Bernoulli" "Cuatrec."

How can I do it to have only the two names, and not the empty string?

PS: My actual regex is more complicated, it's simplified here.

Rodrigo
  • 4,706
  • 6
  • 51
  • 94

3 Answers3

3

In the solutions below set rmChars and splitChars (for the first solution) and chars (for the second solution) to a pattern representing the actual sets of characters you need to use. Depending on your words and non-words you might be able to use built in classes such as chars <- "\\W" which sets chars to all non-word characters.

1) Remove the ( first and then split on ) . Assuming s is the input string:

rmChars <- "[(]"
splitChars <- "[)]"
strsplit(gsub(rmChars, "", s), splitChars)[[1]]

giving:

[1] "Bernoulli" "Cuatrec." 

2) Another possibility is to replace each character in chars with a space, trim the ends and then split on space.

chars <- "[()]"
strsplit(trimws(gsub(chars, " ", s)), " ")[[1]]

giving:

[1] "Bernoulli" "Cuatrec." 
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

I usually tend to avoid installing new libraries, whenever possible. Thus, I can do just:

L <- strsplit(Author,"[()]")[[1]]
L <- L[which(L != "")]

I thought there would be a solution without the need for a library.

Rodrigo
  • 4,706
  • 6
  • 51
  • 94
  • 1
    `stringr` is based on `stringi` and is C-backed. It's super-fast and deals with character sets better. "Minimal" is _not_ always better. – hrbrmstr Jun 22 '15 at 14:09
0

If your data have always the same pattern, you can just use this:

strsplit(Author,"[[:punct:]]")[[1]][-1]
[1] "Bernoulli" "Cuatrec"  

Of course if the pattern is irregular my solution is useless.

SabDeM
  • 7,050
  • 2
  • 25
  • 38