Select values from different columns based on a variable containing column names

Question

I have a data.table like this:

col1   col2   col3  new  
1       4     55    col1 
2       3     44    col2
3       34    35    col2
4       44    87    col3

I want to populate another column matched_value that contains the values from the respective column names given in the new column:

col1   col2   col3  new    matched_value
1       4     55    col1        1
2       3     44    col2        3
3       34    35    col2        34
4       44    87    col3        87

E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.

How can I do this efficiently in R on a very large data.table?

score 13 · Accepted Answer · edited Nov 06 '15 at 20:46

13

An excuse to use the obscure .BY:

DT[, newval := .SD[[.BY[[1]]]], by=new]

   col1 col2 col3  new newval
1:    1    4   55 col1      1
2:    2    3   44 col2      3
3:    3   34   35 col2     34
4:    4   44   87 col3     87

How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.

Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by @David, the two ways are equally fast.

edited Nov 06 '15 at 20:46

Rorschach

31,301
5
78
129

answered Oct 23 '15 at 20:11

Frank

66,179
8
96
180

lovely. very fast Looks like some other language. how does it work? waiting for the explanation! thanks! – user3664020 Oct 23 '15 at 20:16
1

+1 as this solution is faster than my `dt[, matched_value := as.integer(get(new)), by=1:nrow(dt)]` solution. – David Oct 23 '15 at 20:24
@user3664020 Finished adding an explanation. Let me know if it's not clear. I know the code looks bizarre. – Frank Oct 23 '15 at 20:26
1

@David I think `get` might be just as fast (maybe faster?) if you did `get(.BY[[1]]), by=new` – Frank Oct 23 '15 at 20:34
3

@Frank for curiosity's sake I've tested both `get(.BY[[1]])` and `.SD[[.BY[[1]]]]` using `system.time` on 10^6 rows. The two are indistinguishable in time. – David Oct 23 '15 at 21:03
@David Thanks for looking into it; that's interesting and useful to know. I've added a note to the answer. – Frank Oct 23 '15 at 21:11
1

@Frank Just a note that in 2020, `.SD[[.BY[[1]]]]` no longer seems to work with a message indicating that `.SD` cannot be found. Luckily, `get(.BY[[1]])` continues to work! – Martin Dec 01 '20 at 21:09

akrun · Answer 2 · 2015-10-23T20:16:43.297

2

We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.

df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
#  col1 col2 col3  new matched_value
#1    1    4   55 col1             1
#2    2    3   44 col2             3
#3    3   34   35 col2            34
#4    4   44   87 col3            87

NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.

 setDF(df1) #to convert to 'data.frame'.

Benchmarks

set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE), 
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
#   user  system elapsed 
#  2.54    0.37    2.92

edited Oct 23 '15 at 20:16

answered Oct 23 '15 at 19:40

akrun

874,273
37
540
662

Can you please a give an equivalent data table syntax? Also, would it be fast? I have a very big dataset having around 20 million rows. – user3664020 Oct 23 '15 at 19:45
@user3664020 The row/column indexing is fast, but `cbind`ing could be a bottleneck. How many columns you have? – akrun Oct 23 '15 at 19:46
5 columns. And it is data table. – user3664020 Oct 23 '15 at 19:46
Maybe using `fastmatch` ? – Steven Beaupré Oct 23 '15 at 19:47
@akrun could you please give an equivalent data table syntax? I am getting an error while applying it on a data table. – user3664020 Oct 23 '15 at 19:48
@user3644020 It is because of the indexing that is `df1[-1]` may not work, use `df1[, -4, with=FALSE]` – akrun Oct 23 '15 at 19:50
@user3664020 Check if this is fast `df1[, matched_value := melt(df1, id.var='new')[new==variable]$value]` – akrun Oct 23 '15 at 19:52
@akrun it is fast, but it is not picking up correct values, not sure where the values are coming from in the `matched_value` column.... – user3664020 Oct 23 '15 at 19:59
@user3664020 In the example, it was giving the right output, didn't test it carefully. – akrun Oct 23 '15 at 20:04
@akrun melt output is getting sorted from col1 to col4, i,e why i am getting wrong values. How do I prevent sorting? – user3664020 Oct 23 '15 at 20:11
@user3664020 Based on the dataset you have, the base R method is pretty fast, done within 2 secs. – akrun Oct 23 '15 at 20:17
1

presumably `new` is a factor, so really all you need is `transform(dat, val=dat[cbind(1:nrow(dat), new)])` – Rorschach Oct 23 '15 at 21:56

Select values from different columns based on a variable containing column names

2 Answers2

Benchmarks

Linked

Related