0

I am very new to data.table and would like to try it out to see if it makes my analysis faster. I mainly use knitr to compile .rnw files (which I tend to compile many times per hour so I want it to be as fast as possible).

I have posted a sample below and this is by no means a question of comparison agianst data.table and data.frame. I would like to know if I my code below is what it should be.

I am basically joining two data.tables and then need to linearly approximate using na.approx missing NA values. I used the Introduction to data.table vignette from CRAN and JOINing data in R using data.table from R-Pubs.

The code I am using below results in my best attempt at a data.table method taking a long time (in general too, I only added the other code reference).

Also, if anyone knows if there is a way to pipe in na.approx() into a chain and still have the output as a data.frame that would be appreciated. Note the df_merged = as.data.frame(df_merged) line that I would like to get rid of if possible!

Any input is greatly appreciated thank you!

library(data.table)
library(zoo)
library(dplyr)

dt_function_test = function() {
    set.seed(123)
    # data.table
    dt_random = data.table(vals = runif(1E5, 0, 500))
    dt_na = data.table(vals = c(0, 250, 500),
                       ref1 = c(0.33, 0.45, 0.78),
                       ref2 = c(0.12, 0.79, 1))

    dt_merged = merge(dt_random[],
                      dt_na[],
                      all = TRUE)

    dt_merged = dt_merged[, lapply(.SD,
                                   na.approx),
                          by = vals]
}


df_function_test = function() {
    set.seed(123)
    # data.frame
    df_random = data.frame(vals = runif(1E5, 0, 500))
    df_na = data.frame(vals = c(0, 250, 500),
                       ref1 = c(0.33, 0.45, 0.78),
                       ref2 = c(0.12, 0.79, 1))

    df_merged = full_join(df_random,
                          df_na) %>% 
        na.approx

    df_merged = as.data.frame(df_merged)
}

print(system.time(dt_function_test()))
#  user  system elapsed 
# 11.42    0.00   11.46 

print(system.time(df_function_test()))
# Joining, by = "vals"
#    user  system elapsed  
#    0.05    0.05    0.10 
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
Prevost
  • 677
  • 5
  • 20
  • if I am not wrong, they are calculating diff things. the `dplyr` code (someone pls correct me if I am wrong) performs na.approx on each of the column. the `data.table` code performs na.approx on each row (since vals is almost always diff on each row). just print a few rows of each output to check if they are the same. – chinsoon12 Sep 19 '18 at 03:20
  • 1
    You're grouping `by = vals` in the `data.table` code but not in the `data.frame` code? – MichaelChirico Sep 19 '18 at 03:31
  • @chinsoon12 Good point. The results are not in fact the same, but I intended them to be! So my understanding was flawed... I will revise, and also review your answer I see now... – Prevost Sep 19 '18 at 03:53
  • @MichaelChirico The `data.frame` code grouped by the column `vals` automatically because it was common in both, so I didn't specify which column to join by explicitly. – Prevost Sep 19 '18 at 04:15
  • That's not what `by` means? `by` says `apply `na.approx` within each value of `vals`... specifying merge columns should happen in the `merge` step? – MichaelChirico Sep 19 '18 at 04:18
  • I was using the `by` argument incorrectly in the `data.table::merge` line. I thought that specifying `by = vals` would indicate to perform the `na.approx` down each column without actually grouping any of the `data.table`. But that was wrong. In my actual data all the values in the `vals` columns are unique so in effect, the `vals` column contains all unique data and I thought that grouping `by = vals` would in effect not group anything. – Prevost Sep 19 '18 at 04:36

1 Answers1

3

Here is a few possible implementation using data.table that performs zoo::na.approx on the ref* columns (note that a larger dataset has been used also):

library(data.table)
library(zoo)

dt_function_test_0 = function() {
    set.seed(123)
    # data.table
    dt_random = data.table(vals = runif(1e7, 0, 500))
    dt_na = data.table(vals = c(0, 250, 500),
        ref1 = c(0.33, 0.45, 0.78),
        ref2 = c(0.12, 0.79, 1))

    cols <- c("ref1", "ref2")

    ##Version 0
    merge(dt_random, dt_na, all=TRUE)[, lapply(.SD, na.approx)]
}


dt_function_test_1 = function() {
    set.seed(123)
    # data.table
    dt_random = data.table(vals = runif(1e7, 0, 500))
    dt_na = data.table(vals = c(0, 250, 500),
        ref1 = c(0.33, 0.45, 0.78),
        ref2 = c(0.12, 0.79, 1))

    cols <- c("ref1", "ref2")

    ##Version 1: using update by reference
    merge(dt_random, dt_na, all = TRUE)[, 
        (cols) := lapply(.SD, na.approx), .SDcols=cols]
}


dt_function_test_2 = function() {
    set.seed(123)
    # data.table
    dt_random = data.table(vals = runif(1e7, 0, 500))
    dt_na = data.table(vals = c(0, 250, 500),
        ref1 = c(0.33, 0.45, 0.78),
        ref2 = c(0.12, 0.79, 1))

    cols <- c("ref1", "ref2")
    ##Version 2: using set
    dt_merged <- merge(dt_random, dt_na, all = TRUE)
    for (x in cols)
        set(dt_merged, j=x, value=na.approx(dt_merged[[x]]))
    dt_merged
}

timing output:

> system.time(dt_function_test_0())
   user  system elapsed 
   5.44    1.90    6.96 

> system.time(dt_function_test_1())
   user  system elapsed 
   3.55    1.30    4.41 

> system.time(dt_function_test_2())
   user  system elapsed 
   3.78    1.19    4.52
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • That works perfectly! Why must`(cols)` be in brackets? Why doesn't the whole `j` expression have to be in brackets? And am I correct in interpreting that `.SDcols=cols` indicates that the `j` computation should only be carried out on those specific columns? That way it avoids grouping of any kind? So any time I want to carry out a function on specific columns without grouping by any column, I would use the `(colnames):=function(), .SDcols=colnames` arguements? – Prevost Sep 19 '18 at 04:14
  • when you read the vignette on `:=`, you will understand why brackets is required. basically to differentiate from other kinds of update by reference. `.SDcols` can be found in `?data.table` which specifies the columns of `.SD` to be used. You can check out `?.SD` to understand what is .SD. `?data.table` is a must-read as well (tilting head down in shame, I might not have read every word in it). Yes, in *most cases*, you can use `DT[i, j=lapply(.SD, func), by=groupid, .SDcols=colsInDTtoApplyFunc]` – chinsoon12 Sep 19 '18 at 05:49
  • Thanks for the input. I will look into `data.table` more and do some reading! – Prevost Sep 19 '18 at 15:24