-3

Related to How to use data.table within functions and loops?, is there a better way to do the functions shown below, specifically using data.table?
Note: All codes below are functional, but ... slow.
(I used simple "cleaning" steps just to demonstrate the problem).

The objective is to write a function that 1) efficiently 2) replaces 3) some values in data.table, so that it can then be used in a loop to clean large quantities of data-sets.
In C++, this would be done using pointers and call by reference as below:

   void cleanDT(* dataTable dt); cleanDT(&dt222)

In R however, we are copying entire data-sets (data.tables) back and forth every time we call a function.

cleanDT <- function (dt) {
  strNames <- names(dt);   nCols <- 1:length(strNames)
  for (i in nCols) {
    strCol <- strNames[i]
    if ( class(dt[[strCol]]) == "numeric"  ) 
      dt[[strCol]] <- floor(dt[[strCol]])
    else 
      dt[[strCol]] <- gsub("I", "i", dt[[strCol]])
  }
  return(dt)
}
cleanDTByReference <- function (dt) {
  dtCleaned <- dt
  strNames <- names(dt);   nCols <- 1:length(strNames)
  for (i in nCols) {
    strCol = strNames[i]
    if ( class(dt[[strCol]]) == "numeric"  ) 
      dtCleaned[[strCol]] <- floor(dt[[strCol]])
    else 
      dtCleaned[[strCol]] <-  gsub("I", "i", dt[[strCol]]) 
  }
  eval.parent(substitute(dt <- dtCleaned))
}

dt222 <- data.table(ggplot2::diamonds); dt222[1:2]
dt222 <- cleanDT(dt222); dt222[1:2]

dt222 <- data.table(diamonds); dt222[1:2]
#   carat     cut color clarity depth table price    x    y    z
#1:  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#2:  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31

cleanDTByReference(dt222); dt222[1:2]
#   carat     cut color clarity depth table price x y z
#1:     0   ideal     E     Si2    61    55   326 3 3 2
#2:     0 Premium     E     Si1    59    61   326 3 3 2

Then we would use this function to clean a list of data-tables in a loop like this one:

dt333 <- data.table(datasets::mtcars)
listDt <- list(dt222, dt333)

for(dt in listDt) {
  print(dt[1:2])
  cleanDTByReference(dt); print(dt[1:2])
}

Ideally, as a result, I would like to have all my data-tables "cleaned" this ways, using a function. But at the moment without use of references, the code above DOES NOT actually change listDt, nor dt222, dt333.
Can you advise how to achieve that?

Community
  • 1
  • 1
IVIM
  • 2,167
  • 1
  • 15
  • 41
  • If you read the vignettes for the data.table package, they'll introduce a couple ways to modify by reference. Your function is not modifying by reference in the sense meant with respect to data.tables. – Frank Mar 11 '17 at 00:11

2 Answers2

3

You can modify a data.table by reference using a function if you follow data.table syntax.
(I strongly suggest to study the data.table vignettes and FAQ.)

Function definition:

change_DT_in_place <- function(DT){
  cat(address(DT), "\n")
  numcols <- DT[, which(sapply(.SD, is.numeric))]
  cat("num: ", numcols, "- ")
  if (length(numcols) > 0) {
    DT[, (numcols) := lapply(.SD, floor), .SDcols = numcols]
  }
  othcols <-  DT[, which(!sapply(.SD, is.numeric))]
  cat("other: ", othcols, "\n")
  if (length(othcols) > 0) {
    DT[, (othcols) := lapply(.SD, gsub, pattern = "I", replacement = "i"), 
       .SDcols = othcols]
  }
}

Note that I've added some cat() statements to demonstrate the inner workings. address() returns the address in RAM of a variable.

To verify that the function works it has to be demonstrated that the data.tables

  1. have been changed
  2. but haven't been copied.

Create data.tables:

library(data.table)
dt1 <- as.data.table(ggplot2::diamonds)
dt2 <- as.data.table(mtcars)

Modify first data.table:

head(dt1)
#   carat       cut color clarity depth table price    x    y    z
#1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
address(dt1)
#[1] "0000000015660EE0"
change_DT_in_place(dt1)
#0000000015660EE0 
#num:  1 5 6 7 8 9 10 - other:  2 3 4 
address(dt1)
#[1] "0000000015660EE0"
head(dt1)
#   carat       cut color clarity depth table price x y z
#1:     0     ideal     E     Si2    61    55   326 3 3 2
#2:     0   Premium     E     Si1    59    61   326 3 3 2
#3:     0      Good     E     VS1    56    65   327 4 4 2
#4:     0   Premium     i     VS2    62    58   334 4 4 2
#5:     0      Good     J     Si2    63    58   335 4 4 2
#6:     0 Very Good     J    VVS2    62    57   336 3 3 2

Modify second data.table:

head(dt2)
#    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
address(dt2)
#[1] "0000000018C42E78"
change_DT_in_place(dt2)
#0000000018C42E78 
#num:  1 2 3 4 5 6 7 8 9 10 11 - other:   
address(dt2)
#[1] "0000000018C42E78"
head(dt2)
#   mpg cyl disp  hp drat wt qsec vs am gear carb
#1:  21   6  160 110    3  2   16  0  1    4    4
#2:  21   6  160 110    3  2   17  0  1    4    4
#3:  22   4  108  93    3  2   18  1  1    4    1
#4:  21   6  258 110    3  3   19  1  0    3    1
#5:  18   8  360 175    3  3   17  0  0    3    2
#6:  18   6  225 105    2  3   20  1  0    3    1

Conclusion

In both cases the function has changed the data.tables in place as can be seen from the unchanged pointer addresses.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • This is brilliant! - Impressively, now looping with `for(dt in list(dt1,dt2,dt3,dt4) { change_DT_in_place(dt); }` achieves the desired effect - modifies all original data.tables! Thanks! – IVIM Mar 13 '17 at 13:43
  • So this is a huge advantage for `data.table` compared to everything else, where one would probably have to use Reference Classes or some other environment access tricks to get the same result achieved! – IVIM Mar 13 '17 at 13:52
-2

Here is a better way using data.table:

dt <- as.data.table(ggplot2::diamonds)
dt1 <- as.data.table(mtcars)

changeDT <- function(dt){
  cols <- names(dt)
  dt[, c(cols) := lapply(.SD, function(x) ifelse(sapply(x, is.numeric), 
                                                 floor(x), 
                                                 gsub("I", "i", x))),
     .SDcols = cols]
    }

    list1 <- list(dt, dt1)

    x <- lapply(list1, changeDT)
tbradley
  • 2,210
  • 11
  • 20
  • 1
    @IVIM i updated my answer to include how to complete your function using data tables functions to modify the table – tbradley Mar 11 '17 at 02:35
  • 1
    That's a significant improvement w.r.t. the for loop stuff in the OP. However, it doesn't change the data.tables in place which can be verified by using `address()`. This is also reflected in computing times. On my system the `diamond` data sets needs 3.8 sec elapsed time with `changeDT` while `change_DT_in_place()` [here](http://stackoverflow.com/a/42733511/3817004) took less than 0.1 sec. – Uwe Mar 11 '17 at 09:47
  • 1
    @UweBlock i made some edits that will allow the new `data.table` to be saved in the same place as the original without making copies. it may still take a little more time than yours though. Anyway, thanks for the heads up – tbradley Mar 11 '17 at 18:15
  • @tbradly, I liked your idea of using `lapply` instead of `for()`. However note that your function changeDT() generates this Error: `Error in floor(x) : non-numeric argument to mathematical function`. – IVIM Mar 13 '17 at 14:08
  • i had gotten that error while working on this problem, but running the code I currently have posted didn't produce that error on my system. May just be different releases of the required packages. Oh well, thanks for the upvote! – tbradley Mar 13 '17 at 14:26