12

I have a data.table like this:

col1   col2   col3  new  
1       4     55    col1 
2       3     44    col2
3       34    35    col2
4       44    87    col3

I want to populate another column matched_value that contains the values from the respective column names given in the new column:

col1   col2   col3  new    matched_value
1       4     55    col1        1
2       3     44    col2        3
3       34    35    col2        34
4       44    87    col3        87 

E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.

How can I do this efficiently in R on a very large data.table?

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
user3664020
  • 2,980
  • 6
  • 24
  • 45

2 Answers2

13

An excuse to use the obscure .BY:

DT[, newval := .SD[[.BY[[1]]]], by=new]

   col1 col2 col3  new newval
1:    1    4   55 col1      1
2:    2    3   44 col2      3
3:    3   34   35 col2     34
4:    4   44   87 col3     87

How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.

Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by @David, the two ways are equally fast.

Rorschach
  • 31,301
  • 5
  • 78
  • 129
Frank
  • 66,179
  • 8
  • 96
  • 180
  • lovely. very fast Looks like some other language. how does it work? waiting for the explanation! thanks! – user3664020 Oct 23 '15 at 20:16
  • 1
    +1 as this solution is faster than my `dt[, matched_value := as.integer(get(new)), by=1:nrow(dt)]` solution. – David Oct 23 '15 at 20:24
  • @user3664020 Finished adding an explanation. Let me know if it's not clear. I know the code looks bizarre. – Frank Oct 23 '15 at 20:26
  • 1
    @David I think `get` might be just as fast (maybe faster?) if you did `get(.BY[[1]]), by=new` – Frank Oct 23 '15 at 20:34
  • 3
    @Frank for curiosity's sake I've tested both `get(.BY[[1]])` and `.SD[[.BY[[1]]]]` using `system.time` on 10^6 rows. The two are indistinguishable in time. – David Oct 23 '15 at 21:03
  • @David Thanks for looking into it; that's interesting and useful to know. I've added a note to the answer. – Frank Oct 23 '15 at 21:11
  • 1
    @Frank Just a note that in 2020, `.SD[[.BY[[1]]]]` no longer seems to work with a message indicating that `.SD` cannot be found. Luckily, `get(.BY[[1]])` continues to work! – Martin Dec 01 '20 at 21:09
2

We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.

df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
#  col1 col2 col3  new matched_value
#1    1    4   55 col1             1
#2    2    3   44 col2             3
#3    3   34   35 col2            34
#4    4   44   87 col3            87

NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.

 setDF(df1) #to convert to 'data.frame'.

Benchmarks

set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE), 
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
#   user  system elapsed 
#  2.54    0.37    2.92 
akrun
  • 874,273
  • 37
  • 540
  • 662