1

Lets say that I've the next data.table in R:

 test
   a b node_feature_name node_split index node_child_left
1: 0 9                 b        7.5    99              11
2: 1 8                 a        1.5    99              12
3: 2 7                 b        7.5    99              13
4: 4 6                 a        1.5    99              14

And what i would like to have is, a update of the column index,
by the value of node_child_left IF the value assigned by node_feature_name is smaller or equal <= node_split
(dynamic)

Thus i would like to have something as:

1: b -> 9 <= 7.5 == False
2: a -> 1 <= 1.5 == True
3: b -> 7 <= 7.5 == True
4: a -> 4 <= 1.5 == False

But

dt[,index := ifelse( get(node_feature_name) <= node_split, node_child_left, index, by = node_split]

is to slow :'( Also, when i leave out, the by statement -->

test[,index := ifelse( get(node_feature_name) <= node_split, node_child_left, index)]

then i don't have the desired result, because: get(node_feature_name) will represent the column of the first value of node_feature_name (thus column b)

desired result:

   a b node_feature_name node_split index node_child_left
1: 0 9                 b        7.5    99              11
2: 1 8                 a        1.5    12              12
3: 2 7                 b        7.5    13              13
4: 4 6                 a        1.5    99              14
Jaap
  • 81,064
  • 34
  • 182
  • 193
Dieter
  • 2,499
  • 1
  • 23
  • 41

2 Answers2

4
test[, nfeat := .SD[[.BY[[1]]]], by = node_feature_name]
test[nfeat <= node_split, index := node_child_left]

Other ways of creating nfeat:

Data Table - Select Value of Column by Name From Another Column

How to select the columns by the content in another column in data.table of R?

Select values from different columns based on a variable containing column names

I guess this is not a good way to structure your data (with column names referred to in another column), but without knowing more, I can't really say how best to improve it.

Re speed, we don't know whether the full data has a ton more columns like a and b or a ton more rows, so I am not going to make my own example data to test with.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    I was thinking about some option, the `.BY` looks promising. Have to check the timings – akrun Jan 24 '19 at 17:17
  • @frank, at this moment i'm working with a Decision tree (one of my random forest model). And actually what i've to do is, I've to go trough the whole tree. (my training sample contains out of 700.000 records and the deepest node (leaf) is around 80 nodes. --- therefore, i had the following setup in mind, --> i write a loop wereby i go (from the root node) to the next node in each iteration therefore each record has to know its index(current node), current feature, split value, left & right child, ------ ( actually it should work quite good, except of this bottle neck) – Dieter Jan 24 '19 at 18:13
  • 1
    @Dieter Hm, interesting problem. If I understand correctly, I guess I would use a few tables -- one for parent-child relation; another for node attributes (split value, feature name, is it terminal); another in long form for the record attributes (record, feature name, feature value having 700k x # features rows); and another like yours with one row per record, iteratively updating non-terminated records by joining with the other tables... It might be too complicated a problem to fit on SO, but you could try asking if you don't find an efficient solution. – Frank Jan 24 '19 at 18:36
0

If we change the by from 'node_splitto sequence of rows, theget` would get the correct column values as 'node_split' have duplicate elements

library(data.table)
setDT(test)[, index := if(get(node_feature_name) <= node_split) node_child_left 
                else index , by = seq_len(nrow(test))]
test
#   a b node_feature_name node_split index node_child_left
#1: 0 9                 b        7.5    99              11
#2: 1 8                 a        1.5    12              12
#3: 2 7                 b        7.5    13              13
#4: 4 6                 a        1.5    99              14

NOTE: It is better to have grouped by sequence rather than some other groups as there is always a possibility of having number of rows greater than 1


If it is a base R, then can use row/column indexing to make it faster

setDF(test)
i1 <- test[1:2][cbind(seq_len(nrow(test)), 
          match(test$node_feature_name, names(test)))] <= test$node_split
test$index[i1] <- test$node_child_left[i1]

Benchmarks

On a slightly bigger dataset, the timings for base R method is

test1 <- test[rep(seq_len(nrow(test)), 1e6), ]

 system.time({


 i1 <- test1[1:2][cbind(seq_len(nrow(test1)), 
           match(test1$node_feature_name, names(test1)))] <= test1$node_split
 test1$index[i1] <- test1$node_child_left[i1]

 })
 #user  system elapsed 
 #  0.116   0.020   0.136 

and the get approach with data.table

system.time({
 setDT(test1)[, index := if(get(node_feature_name) <= node_split) node_child_left 
                 else index , by = seq_len(nrow(test1))]

 })
# user  system elapsed 
#  8.678   0.187   7.792  

data

test <- structure(list(a = c(0L, 1L, 2L, 4L), b = 9:6, node_feature_name = c("b", 
"a", "b", "a"), node_split = c(7.5, 1.5, 7.5, 1.5), index = c(99L, 
 99L, 99L, 99L), node_child_left = 11:14), class = "data.frame", row.names = c(NA, 
 -4L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • OP is complaining about by some column being slow, and you think by each row is a good solution?! – eddi Jan 24 '19 at 15:52
  • @akrun - (i didn't downvote) but it is as slow as the other one (+ i already tried it before) – Dieter Jan 24 '19 at 15:52
  • @Dieter IN that case, use the row/column indexing from `base R`. That would be much faster. I was just showing why you didn't get the correct answer in one of the tries – akrun Jan 24 '19 at 15:54
  • @akrun maybe that's worth a comment, but as a solution, this is no good – eddi Jan 24 '19 at 15:55