1

I have the data frame

df=data.frame(x=rnorm(8),y=runif(8),longstring=c("foo_100_Case1","foo_125_Case1","bar_100_Case1","bar_125_Case1","foo_100_Case2","foo_125_Case2","bar_100_Case2","bar_125_Case2"),stringsAsFactors = F)

I need to split the last column in three columns, with delimiter "_". I've been doing the following:

a=matrix(unlist(strsplit(df$longstring,"_",fixed=T)),8,3,byrow = T)
df$type=a[,1]
df$point=a[,2]
df$case=a[,3]

But I wonder if there's an easier way: the combination of strsplitand unlist is particularly awkward, and it doesn't make the code very readable.

DeltaIV
  • 4,773
  • 12
  • 39
  • 86

2 Answers2

7

Here are some more options to try:

My "splitstackshape" package is designed for this kind of stuff...

library(splitstackshape)
cSplit(df, "longstring", "_")
#              x         y longstring_1 longstring_2 longstring_3
# 1: -1.41524742 0.2123978          foo          100        Case1
# 2: -1.09240237 0.3899935          foo          125        Case1
# 3:  0.39675025 0.2162463          bar          100        Case1
# 4: -1.14996728 0.7608128          bar          125        Case1
# 5: -0.07657172 0.6878348          foo          100        Case2
# 6:  0.29549599 0.2216566          foo          125        Case2
# 7:  1.78622612 0.1496666          bar          100        Case2
# 8: -0.11749579 0.9255409          bar          125        Case2

The "data.table" package brings us the fast tstrsplit function...

library(data.table)
as.data.table(df)[
  , paste0("V", 1:3) := tstrsplit(longstring, "_")][
    , longstring := NULL][]

If you have the time and want to wait for read.table to do its work...

cbind(df[1:2], read.table(text = df$longstring, sep = "_"))

If you need something else that is fast...

library(iotools)
cbind(df[1:2], mstrsplit(df$longstring, sep = "_"))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
4

You can try,

cbind(df[-3], data.frame(do.call('rbind', strsplit(df$longstring,'_'))))

#    x               y   X1  X2    X3
#1 -0.5522704 0.9998266 foo 100 Case1
#2  1.1907351 0.8979460 foo 125 Case1
#3  0.6005691 0.4301610 bar 100 Case1
#4 -1.0698081 0.9626781 bar 125 Case1
#5 -0.8526932 0.9634738 foo 100 Case2
#6  0.0100209 0.2968137 foo 125 Case2
#7 -1.5051358 0.7012956 bar 100 Case2
#8  1.0892584 0.4655736 bar 125 Case2

The do.call function allows you to call any R function and instead of sending the argument one by one you can use a list to hold the arguments. More info ?do.call. The do.call returns only X1, X2 and X3 columns and to get your original data frame i am using cbind to bind the original columns. The strsplit as you already know splits the string across _.


Or as @joran mentioned you can use separate from tidyr package like

library(tidyr)
separate(df, longstring, c("X1", "X2", "X3"), sep="_")

#    x               y   X1  X2    X3
#1 -0.5522704 0.9998266 foo 100 Case1
#2  1.1907351 0.8979460 foo 125 Case1
#3  0.6005691 0.4301610 bar 100 Case1
#4 -1.0698081 0.9626781 bar 125 Case1
#5 -0.8526932 0.9634738 foo 100 Case2
#6  0.0100209 0.2968137 foo 125 Case2
#7 -1.5051358 0.7012956 bar 100 Case2
#8  1.0892584 0.4655736 bar 125 Case2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Definitely more readable than mine, and I like that you drop `df$longstring`. Can you please explain your code, though? What's `do.call`? Why do you need both `cbind` and `rbind`? Why in your code do you need `as.character`, while in mine I can directly write `strsplit(df$longstring...)`? – DeltaIV Dec 17 '15 at 16:28
  • 1
    Ok, I like the `tidyr` syntax best. Usually I don't like to install packages to do stuff which can be done in base R will do, but I really like the fact that `tidyr` allows me to separate and rename columns in one go. And the syntax is much more readable! – DeltaIV Dec 17 '15 at 16:42
  • @DeltaIV you are right, `tidyr` is more readable and also there is no need for `as.character` in `strsplit`, hence updated – Ronak Shah Dec 17 '15 at 16:43