1

I have a very large data frame but the gist of the variable that I am interested in looks like this:

A    B
1    DW
2    DI
3    RW
4    RI
5    DW
6    RI
7    RW
8    DI

I want to break up variable B such that I create a variable that would reflect only the first letter in B and a variable that reflects only the last letter in B.

To that end, my data frame would have 2 additional variables, resulting in something that looks like this

A    B    C  D
1    DW   D  W
2    DI   D  I
3    RW   R  W
4    RI   R  I
5    DW   D  W
6    RI   R  I
7    RW   R  W
8    DI   D  I

My data frame has over 100,000 lines. How can I get R to do this automatically?

Thank you for your help!

Jennifer
  • 285
  • 1
  • 3
  • 14

4 Answers4

4

We can do this with separate

library(tidyr)
library(dplyr)
df1 %>%
   separate(B, into = c('C', 'D'), sep= 1, remove = FALSE)
#  A  B C D
#1 1 DW D W
#2 2 DI D I
#3 3 RW R W
#4 4 RI R I
#5 5 DW D W
#6 6 RI R I
#7 7 RW R W
#8 8 DI D I

data

df1 <- structure(list(A = 1:8, B = c("DW", "DI", "RW", "RI", "DW", "RI", 
"RW", "DI")), class = "data.frame", row.names = c(NA, -8L))
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Here is a base R solution

df <- cbind(df,`names<-`(data.frame(do.call(rbind,strsplit(df$B,""))),c("C","D")))

such that

> df
  A  B C D
1 1 DW D W
2 2 DI D I
3 3 RW R W
4 4 RI R I
5 5 DW D W
6 6 RI R I
7 7 RW R W
8 8 DI D I
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1

Here's a solution using substr in base R:

cbind(df, 'C' = substr(df$B, 1, 1), 'D' = substr(df$B, nchar(df$B), nchar(df$B)))

#  A  B C D
#1 1 DW D W
#2 2 DI D I
#3 3 RW R W
#4 4 RI R I
#5 5 DW D W
#6 6 RI R I
#7 7 RW R W
#8 8 DI D I
sm925
  • 2,648
  • 1
  • 16
  • 28
1

The OP notes the large size of the dataframe. So it seems worthwhile to benchmark the three approaches proposed:

library(microbenchmark)

df <-
  tibble(
    A = 1:100000,
    B = paste0(sample(LETTERS,100000,replace=T),sample(LETTERS,100000,replace=T))
  )

microbenchmark(
  tidyr = df %>%
    separate(B, into = c('C', 'D'), sep= 1, remove = FALSE),
  strsplit = cbind(df,`names<-`(data.frame(do.call(rbind,strsplit(df$B,""))),c("C","D"))),
  substr = cbind(df, 'C' = substr(df$B, 1, 1), 'D' = substr(df$B, nchar(df$B), nchar(df$B)))
)

The tidyr solution using separate() is notably faster:

Unit: milliseconds
     expr     min       lq     mean   median       uq      max neval
    tidyr 10.9737 11.99655 13.59860 13.32865 13.98510  28.6455   100
 strsplit 39.4084 42.33310 47.20898 46.13145 51.55815  67.0940   100
   substr 42.3147 47.90830 54.42131 51.05375 55.79760 184.6317   100

Though the gain diminishes somewhat with longer strings. For instance, adding a third character and adapting the above:

Unit: milliseconds
     expr     min      lq     mean   median       uq      max neval
    tidyr 17.6609 19.7422 24.06847 21.75830 22.93855  54.1001   100
 strsplit 43.7746 58.0660 69.91389 64.69815 72.97280 199.4662   100
   substr 50.8109 56.5016 65.98295 59.53490 65.45865 154.3368   100
AHart
  • 448
  • 3
  • 10