1

in many cases, when I need to address a column with its name passed in a variable, I see two following options: either myDT[[myCol]], or myDT[,get(myCol)], for example:

# get() ####
cast_num_get <- function(inpDT, cols2cast){
  for (thisCol in cols2cast){
    inpDT[, (thisCol):=as.numeric(get(thisCol))]
  }
  return(inpDT);
}

# [[ #### 
cast_num_b <- function(inpDT, cols2cast){
  for (thisCol in cols2cast){
    inpDT[[thisCol]] <- inpDT[[thisCol]]
  }
  return(inpDT);
}


# two more options added from the comments: 

# lapply(.SD) ####
cast_num_apply <- function(inpDT, cols2cast){
  inpDT[, (cols2cast) := lapply(.SD, as.numeric), .SDcols = cols2cast]
  return(inpDT);
}

# set() ####
cast_num_for_set <- function(inpDT, cols2cast){
  for (thisCol in cols2cast){
    set(inpDT, j = thisCol, value = as.numeric(inpDT[[thisCol]]))
  }
  return(inpDT);
}
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • 1
    In this example `inpDT[, (cols2cast) := lapply(.SD, as.numeric), .Sdcols = cols2cast]` would imo be better than any of the two options you showed. – Jaap Mar 18 '18 at 09:29
  • 1
    Also, if you already resort to a for loop, use `set` instead of `\`[.data.table\``. Other than that `:=` assigns by reference, while `[[` is primitive and very ligh-weight (though possibly makes a copy- at least a shalow one). `get` is usually very in-efficient and there usually better alternatives when working within the data.table environment. – David Arenburg Mar 18 '18 at 09:35
  • You don't show any benchmarks but I suspect you are not benchmarking correctly. [[ should be faster than get within [ even if only because it's one function call vs two calls. – Roland Mar 18 '18 at 09:36
  • Related / possible duplicate: [*`Convert *some* column classes in data.table`*](https://stackoverflow.com/q/32940580/2204410) – Jaap Mar 18 '18 at 09:40
  • @Roland, you're right, I had a mistake there (sorry!). `[[` is _much_ faster. I will add correct benchmarking results. – Vasily A Mar 18 '18 at 09:42
  • @Jaap, if possible, could you tell why `lapply()` would be better? In this specific example, I agree, it allows to skip the loop - but what if I didn't have a loop (if I worked with only single column)? Thanks! P.S. maybe you could also make it as an answer, not a comment? – Vasily A Mar 18 '18 at 09:48
  • @DavidArenburg, thanks for pointing out `set`, I will add it as well! – Vasily A Mar 18 '18 at 09:48
  • @DavidArenburg, maybe you could also move your very helpful explanation to the answer? - it would be easier for other users to find it if someone else will read the question – Vasily A Mar 18 '18 at 09:52
  • 1
    Your `cast_num_b`-option has an unfair advantage: it doesn't do anything. Instead of `inpDT[[thisCol]] <- inpDT[[thisCol]]`, you should use `inpDT[[thisCol]] <- as.numeric(inpDT[[thisCol]])` inside that function. Added an answer as well with a comparison of the different approaches and some explanation. – Jaap Mar 18 '18 at 11:35
  • damn, sorry for this another error (I should stop coding with sleeping brain...) – Vasily A Mar 18 '18 at 17:24

1 Answers1

3

For this example I would use:

DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]

Two alternatives (based on my answer here) with for:

# alternative 1 with 'set'
for (col in cols) set(DT, j = col, value = as.numeric(DT[[col]]))

# alternative 2 with ':='
for (col in cols) DT[, (col) := as.numeric(DT[[col]])]

Neither of these three approaches is necessarily better. They all have the same advantage: they will update the DT by reference.

Comparing the different approaches with a benchmark:

microbenchmark(vasily_get = {inpDT <- copy(DT); cast_num_get(inpDT, cols)},
               vasily_b = {inpDT <- copy(DT); inpDT <- cast_num_b(inpDT, cols)},
               jaap_lapply = {inpDT <- copy(DT); inpDT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]},
               jaap_for_set1 = {inpDT <- copy(DT); for (col in cols) set(inpDT, j = col, value = as.numeric(inpDT[[col]]))},
               jaap_for_set2 = {inpDT <- copy(DT); for (col in cols) inpDT[, (col) := as.numeric(inpDT[[col]])]},
               times = 100)

gives:

Unit: milliseconds
          expr      min       lq     mean   median       uq      max
    vasily_get 399.0723 414.2708 530.3024 429.5070 663.3513 1194.827
      vasily_b 388.7294 408.0004 528.4039 418.9236 664.5881 1441.941
   jaap_lapply 401.8001 424.1902 562.9259 453.5073 668.3900 1376.654
 jaap_for_set1 399.2213 433.9918 568.7211 628.4220 668.1248 1198.950
 jaap_for_set2 395.1966 405.5584 510.2038 421.3801 652.1263 1097.931

Neither of the approaches stands out with regard to speed. However, the cast_num_b aproach has one big disadvantage: to make the change permanent, you will have to assign the result of that function back to the input data.table.

When you run the following code:

inpDT <- copy(DT)
address(inpDT)
inpDT <- cast_num_b(inpDT, cols)
address(inpDT)

you get:

> inpDT <- copy(DT)
> address(inpDT)
[1] "0x145eb6a00"
> inpDT <- cast_num_b(inpDT, cols)
> address(inpDT)
[1] "0x12a632ce8"

As you can see, the location in the computer's memory has changed. It can therefore be considered the less efficient approach.


Used data:

DT <- data.table(lets = sample(LETTERS, 1e6, TRUE),
                 V1 = as.character(rnorm(1e6)),
                 V2 = as.character(rnorm(1e6)),
                 V3 = as.character(rnorm(1e6)),
                 V4 = as.character(rnorm(1e6)))

cols <- names(DT)[2:5]
Frank
  • 66,179
  • 8
  • 96
  • 180
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • I'd be curious to see how this varies with nrows & ncols. I tried `n = 1e1; nc = 1e6; DT = as.data.table(matrix(sample(c("bah", "1.0"), n*nc, TRUE), n, nc); system.time(DT[, lapply(.SD, as.numeric)])` but didn't have the patience to wait for it to finish. Btw, Jan started working on a benchmarking vignette https://github.com/Rdatatable/data.table/blob/f9195b5179ce57cc5191b1b68ba29892dbb91a12/vignettes/datatable-benchmarking.Rmd it mentions use of copy and times=100 that might be relevant. – Frank Mar 18 '18 at 18:41
  • 1
    @Frank Thx. I agree that using `times = 100` isn't a necessity, `times = 10` is just as good (and will lead to the same conclusion). Although I agree that using `copy` adds overhead, I used it to get a fair benchmark between the several methods. Anyway, will study that vignette, but its pretty dense imo atm. – Jaap Mar 18 '18 at 18:55