How to split R data.frame column based regular expression condition

Question

I have a data.frame and I want to split one of its columns to two based on a regular expression. More specifically the strings have a suffix in parentheses that needs to be extracted to a column of its own.

So e.g. I want to get from here:

dfInit <- data.frame(VAR = paste0(c(1:10),"(",c("A","B"),")"))

to here:

dfFinal <- data.frame(VAR1 = c(1:10), VAR2 = c("A","B"))

G. Grothendieck · Accepted Answer · 2014-10-15T13:39:49.647

1) gsubfn::read.pattern read.pattern in the gsubfn package can do that. The matches to the parenthesized portions of the regular rexpression are regarded as the fields:

library(gsubfn)
read.pattern(text = as.character(dfInit$VAR), pattern = "(.*)[(](.*)[)]$")

giving:

2) sub Another way is to use sub:

data.frame(V1=sub("\\(.*", "", dfInit$VAR), V2=sub(".*\\((.)\\)$", "\\1", dfInit$VAR))

giving the same result.

3) read.table This solution does not use a regular expression:

read.table(text = as.character(dfInit$VAR), sep = "(", comment = ")")

giving the same result.

akrun · Answer 2 · 2014-10-15T14:48:35.167

3

You could also use extract from tidyr

library(tidyr)
extract(dfInit, VAR, c("VAR1", "VAR2"), "(\\d+).([[:alpha:]]+).", convert=TRUE) # edited and added `convert=TRUE` as per @aosmith's comments.



#    VAR1 VAR2
#1     1    A
#2     2    B
#3     3    A
#4     4    B
#5     5    A
#6     6    B
#7     7    A
#8     8    B
#9     9    A
#10   10    B

edited Oct 15 '14 at 14:48

answered Oct 15 '14 at 13:54

akrun

874,273
37
540
662

1

Setting `convert` to `TRUE` in `extract` avoids the need for `mutate`, although `VAR2` is then converted to a factor. – aosmith Oct 15 '14 at 14:40

score 1 · Answer 3 · edited May 23 '17 at 11:45

1

See Split column at delimiter in data frame

dfFinal <- within(dfInit, VAR<-data.frame(do.call('rbind', strsplit(as.character(VAR), '[[:punct:]]'))))

> dfFinal
   VAR.X1 VAR.X2
1       1      A
2       2      B
3       3      A
4       4      B
5       5      A
6       6      B
7       7      A
8       8      B
9       9      A
10     10      B

edited May 23 '17 at 11:45

Community

1
1

answered Oct 15 '14 at 13:52

rgunning

568
2
16

Rich Scriven · Answer 4 · 2014-10-16T15:43:22.403

1

You can also use cSplit from splitstackshape.

library(splitstackshape)
cSplit(dfInit, "VAR", "[()]", fixed=FALSE)
#    VAR_1 VAR_2
# 1:     1     A
# 2:     2     B
# 3:     3     A
# 4:     4     B
# 5:     5     A
# 6:     6     B
# 7:     7     A
# 8:     8     B
# 9:     9     A
#10:    10     B

edited Oct 16 '14 at 15:43

answered Oct 15 '14 at 18:41

Rich Scriven

97,041
11
181
245

@akrun - Thanks a lot for the edit. I didn't think a regex `sep` was possible yet. – Rich Scriven Oct 16 '14 at 15:44
No problem. I had a similar case earlier, and Ananda Mahto suggested this. – akrun Oct 16 '14 at 15:56

score 1 · Answer 5 · answered Oct 15 '14 at 19:09

1

An approach with regmatches and gregexpr:

as.data.frame(do.call(rbind, regmatches(dfInit$VAR, gregexpr("\\w+", dfInit$VAR))))

answered Oct 15 '14 at 19:09

Sven Hohenstein

80,497
17
145
168

How to split R data.frame column based regular expression condition

5 Answers5