0

These three commands return the same result (regression on a subset of observations). I'd like to know whether there are important differences in term of what data.table really does in the background.

suppressMessages(library("data.table"))
suppressMessages(library("biglm"))
N=1e7; K=100
set.seed(1)
DT <- data.table(
  id = 1:N,
  v1 =  sample(5, N, TRUE),                          # int in range [1,5]
  v2 =  sample(1e6, N, TRUE),                        # int in range [1,1e6]
  v3 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
DT[, condition := id>100]

# fist command
coefficients(biglm(v3 ~ v2 + v1, DT[id>100, c("v1", "v2", "v3"), with = FALSE]))

# second command
DT[ id >100, coefficients(biglm(v3 ~ v2 + v1, .SD)), .SDcols = c("v1", "v2", "v3")]

# third command
DT[, coefficients(biglm(v3 ~ v2 + v1, .SD)), by = condition, .SDcols = c("v1", "v2", "v3")]

If I run each command in a new session of R, it seems that the time taken by each command is the same. In a more general situation, are all these commands equivalent from a memory / speed point of view? Thanks!

Matthew
  • 2,628
  • 1
  • 20
  • 35

1 Answers1

1

Too long for a comment....

first command

  • Creates an explicit temporary copy of DT where id>100
  • Does not appear to create coefficients for id<=100
  • creates a length N logical vector and uses a vector scan to subset

Second command

  • creates an implicit copy of of DT where id>100
  • creates a length N logical vector and uses a vector scan to subset
  • Does not appear to create coefficients for id<=100

Third command

  • creates an implicit copes of of DT where id>100 and id <=100 (only using the memory for the larger)
  • creates two sets of coefficients.

From a speed point of view, I would think that the processing of biglm would be slow point (relative to any use of data.table).

You might get a speed up by setting the key by condition (however this will take time). If you need to use this condition generally for subsetting this would appear to be useful.

eg.

setkey(DT, condition)

You can then use a binary search to extract data by the condition

DT[.(TRUE)]
DT[.(FALSE)]

These will be faster, and more memory efficient.

It might be worth investigating processing subsets of the data using update to add them to the biglm. Look at using profr or utils::Rprof (see How to efficiently use Rprof in R?) and the documenation.

Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254