strsplit returns empty string with regex

Question

In R, I have a variable Author, with the value "(Bernoulli)Cuatrec."

I want to have only the names, so I'm using the following regex:

L <- strsplit(Author,"[()]")

but that's giving me 3 strings as result:

""          "Bernoulli" "Cuatrec."

How can I do it to have only the two names, and not the empty string?

PS: My actual regex is more complicated, it's simplified here.

Try `library(stringr);str_extract_all(Author, '[^()]+')[[1]]` — akrun, Jun 22 '15 at 13:33
You could also try the `stringi` package `stri_split_regex(Author, "[()]", omit_empty = TRUE)` — David Arenburg, Jun 22 '15 at 13:34

G. Grothendieck · Answer 1 · 2015-06-22T15:41:54.827

3

In the solutions below set rmChars and splitChars (for the first solution) and chars (for the second solution) to a pattern representing the actual sets of characters you need to use. Depending on your words and non-words you might be able to use built in classes such as chars <- "\\W" which sets chars to all non-word characters.

1) Remove the ( first and then split on ) . Assuming s is the input string:

rmChars <- "[(]"
splitChars <- "[)]"
strsplit(gsub(rmChars, "", s), splitChars)[[1]]

giving:

[1] "Bernoulli" "Cuatrec."

2) Another possibility is to replace each character in chars with a space, trim the ends and then split on space.

chars <- "[()]"
strsplit(trimws(gsub(chars, " ", s)), " ")[[1]]

giving:

[1] "Bernoulli" "Cuatrec."

edited Jun 22 '15 at 15:41

answered Jun 22 '15 at 13:44

G. Grothendieck

254,981
17
203
341

Thanks, it works, but I have more separators than just (). My actual regex is strsplit(Author,"[ 众古在未()&,;]{1,}"). Maybe I should have put it all in the question? – Rodrigo Jun 22 '15 at 13:51
OK. I have modularized it. – G. Grothendieck Jun 22 '15 at 14:05
What's the earliest R version for `trimws`? First time seeing it. – David Arenburg Jun 22 '15 at 14:14

score 0 · Answer 2 · answered Jun 22 '15 at 13:41

0

I usually tend to avoid installing new libraries, whenever possible. Thus, I can do just:

L <- strsplit(Author,"[()]")[[1]]
L <- L[which(L != "")]

I thought there would be a solution without the need for a library.

answered Jun 22 '15 at 13:41

Rodrigo

4,706
6
51
94

1

`stringr` is based on `stringi` and is C-backed. It's super-fast and deals with character sets better. "Minimal" is _not_ always better. – hrbrmstr Jun 22 '15 at 14:09

score 0 · Answer 3 · answered Jun 22 '15 at 14:00

0

If your data have always the same pattern, you can just use this:

strsplit(Author,"[[:punct:]]")[[1]][-1]
[1] "Bernoulli" "Cuatrec"

Of course if the pattern is irregular my solution is useless.

answered Jun 22 '15 at 14:00

SabDeM

7,050
2
25
38

strsplit returns empty string with regex

3 Answers3

Linked