Using split function in R on special character

Question

I have a R dataFrame that contains the following:

column1          column2
score1...test1     10
score2...test2     11
score3...test3     15

I would like to reshape my dataFrame into the following:

column1          column2    score     test
score1...test1     10       score1    test1
score1...test2     11       score2    test2
score1...test3     15       score3    test3

I tried to use

library(stringr)
temp=str_split_fixed(df, " ...", 4)

but I am getting this

 [,1] [,2] [,3] [,4]

How can tackle this problem?

You are using this function in correctly. You should split a column, not the whole data set. And the pattern should be just "..." rather " ...". Also, `n` should 2 instead of 4. Regardlles, this seems like a bug in the `stringr` package because `str_split_fixed(df$column1, "...", 2)` doesn't work properly, whereas `stringi::stri_split_fixed(df$column1, "...", 2, simplify = TRUE)` works. And this `str_split_fixed(df$column1, "\\.\\.\\.", 2)` works too, while it shouldn't as this should be `fixed`. The whole thing looks messed up. — David Arenburg, Mar 14 '16 at 12:00

lukeA · Answer 1 · 2016-03-14T12:21:21.473

4

You can do

library(splitstackshape)
cSplit(df, 1, "...", drop=F)
#           column1 column2 column1_1 column1_2
# 1: score1...test1      10    score1     test1
# 2: score2...test2      11    score2     test2
# 3: score3...test3      15    score3     test3

Or setnames(cSplit(df, 1, "...", drop=F), 3:4, c("score", "test"))[] it you need to give custom names.

edited Mar 14 '16 at 12:21

answered Mar 14 '16 at 11:49

lukeA

53,097
5
97
100

1

I'd suggest `setnames` instead of `setNames` as we are already dealing with a "data.table" as the output of `cSplit`. – A5C1D2H2I1M1N2O1R2T1 Mar 14 '16 at 12:00
@AnandaMahto Good idea, I changed that. – lukeA Mar 14 '16 at 12:21

akrun · Accepted Answer · 2016-03-14T12:11:33.610

2

We can do this with base R. Replace the ... with , using sub, read the strings with read.csv to create a data.frame with two columns, cbind with the original dataset to get the expected output.

  cbind(df,read.csv(text=sub('[[:punct:]]+', ',', 
      df$column1), header=FALSE, col.names=c('score', 'test')))
 #          column1 column2  score  test
 #1 score1...test1      10 score1 test1
 #2 score2...test2      11 score2 test2
 #3 score3...test3      15 score3 test3

Or we can directly use ... as the pattern in sub, replace with `,', and the rest is above.

 cbind(df,read.csv(text=sub('...', ',', 
      df$column1, fixed=TRUE), header=FALSE, 
         col.names=c('score', 'test')))

If we need a package solution, separate from tidyr can be used.

library(tidyr)
separate(df, column1, into=c("score", "test"), remove=FALSE)

edited Mar 14 '16 at 12:11

answered Mar 14 '16 at 11:46

akrun

874,273
37
540
662

what is the spliting criteria for the second method? can I specify the "..." as my splitting criteria? – user3841581 Mar 14 '16 at 12:08
@user3841581 In the first two cases, we replaced the `...` with `,` and by default, the `read.csv` use `sep=","`. In `separate`, it will automatically detect the sep as `...` – akrun Mar 14 '16 at 12:09
sorry for the question, I already have my dataFrame, why should I read it again? If I already have the dataFrame, how does the first and second method works? – user3841581 Mar 14 '16 at 12:13
@user3841581 I understand that you already have a data.frame. It is a way to split the dataset if you are using the first method in `base R`. In the `separate`, it is more straightforward. – akrun Mar 14 '16 at 12:14
thank you so much. The problem is that, I might have pattern like "k..score2...test2...results". your method seems to split everything. That is why I would have like to have a splitting condition. – user3841581 Mar 14 '16 at 12:30
@user3841581 In that case use `gsub` instead of `sub` as `sub` will only replace the first occurence and use `read.csv(text=gsub('...', ',', df$column1, fixed=TRUE), header=FALSE, fill=TRUE)` – akrun Mar 14 '16 at 12:31
1

Perfect. It works well:) – user3841581 Mar 14 '16 at 12:43

Using split function in R on special character

2 Answers2