2

I have a 20000 * 5 data set. Currently it is being processed in an iterative manner and the data set gets updated continuously on every iteration.

The cells in the data.frame gets updated for every iteration and looking for some help in running these things faster. Since this is a small data.frame I'm not sure if data.table would work fine.

Here are the benchmarks for data.frame subassignment:

sessionInfo()
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
set.seed(1234)
test <- data.frame(A = rep(LETTERS  , 800), B = rep(1:26, 800),    C=runif(20800), D=runif(20800) , E =rnorm(20800))
microbenchmark::microbenchmark(test[765,"C"] <- test[765,"C"] + 25)
Unit: microseconds
                                  expr     min       lq     mean   median       uq      max neval
 test[765, "C"] <- test[765, "C"] + 25 112.306 130.8485 979.4584 186.3025 197.7565 44556.15   100}

Is there a way to achieve the above function faster than what I have posted?

Roland
  • 127,288
  • 10
  • 191
  • 288
sak88
  • 31
  • 3
  • 2
    The fastest way is the `set` function in package data.table. Obviously, it's even faster to avoid doing this. – Roland Nov 02 '16 at 08:13
  • Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269) . This will make it much easier for others to help you. – Jaap Nov 02 '16 at 08:18
  • microbenchmark::microbenchmark(test[765,"C"] <- test[[765,"C"]] + 25) this is faster than the one in my post, but are there any alternatives ? – sak88 Nov 02 '16 at 08:32

2 Answers2

4

Interestingly enough, if you're using a data.table it doesn't seem to be faster at first glance. Perhaps it's getting faster when using the assignment inside of a loop.

library(data.table)
library(microbenchmark)
dt <- data.table(test)

# Accessing the entry
dt[765, "C", with = FALSE] 

# Replacing the value with the new one
# Basic data.table syntax
dt[i =765, C := C + 25 ]

# Replacing the value with the new one
# using set() from data.table
set(dt, i = 765L, j = "C", value = dt[765L,C] + 25)

microbenchmark(
      a = set(dt, i = 765L, j = "C", value = dt[765L,C] + 25)
    , b = dt[i =765, C := C + 25 ]
    , c = test[765, "C"] <- test[765, "C"] + 25
    , times = 1000       
  )

The results from microbenchmark:

                                                   expr     min      lq     mean  median       uq      max neval
 a = set(dt, i = 765L, j = "C", value = dt[765L, C] + 25) 236.357 46.621 266.4188 250.847 260.2050  572.630  1000
 b = dt[i = 765, `:=`(C, C + 25)]                         333.556 345.329 375.8690 351.668 362.6860 1603.482  1000
 c = test[765, "C"] <- test[765, "C"] + 25                73.051  81.805 129.1665  84.220  87.6915 1749.281  1000
hannes101
  • 2,410
  • 1
  • 17
  • 40
  • 2
    Use `a =` instead of `a <-`. Interesting and surprising benchmarks. Apparently, `[<-.data.frame` has been improved and now doesn't copy the data.frame anymore? – Roland Nov 02 '16 at 09:01
  • May worth adding the benchmark using 20k row (nearly the whole example df), as the results are really differents. – Tensibai Nov 02 '16 at 09:57
2

You can start just with manual of ?set function. In example you will find code that you can use to benchmark. I just re-run it and got the following timings.

library(data.table)
m = matrix(1, nrow = 2e6L, ncol = 100L)
DF = as.data.frame(m)
DT = as.data.table(m)    

system.time(for (i in 1:1000) DF[i, 1] = i)
#   user  system elapsed 
#  3.048   1.512  24.854
system.time(for (i in 1:1000) DT[i, V1 := i])
#   user  system elapsed 
#  0.232   0.000   0.259 
system.time(for (i in 1:1000) set(DT, i, 1L, i))
#   user  system elapsed 
#  0.000   0.000   0.002

Ideally you need to check your data update scenario on your data and scale to properly measure which is the "fastest". Also be sure to check memory usage, using [<- on matrix seems to use more memory than data.table way, if you end up swapping it will be way slower.

jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • the set approach works fine if i'm updating values in a loop, however the data frame approach seem to perform better for cases where i'm updating only a few cells in a data set. i'm a newbie and sorry if that's a dumb question. – sak88 Nov 02 '16 at 12:41
  • @sak88 It is fine, you put reproducible example, as for me it is most important after communicating the question :) Be aware you can update _slices_ of data.table by providing vector (not scalar) to `i` argument. – jangorecki Nov 02 '16 at 13:22
  • @sak88 the data.frame approach is faster for a tiny data set and the difference is in microseconds. The bigger the data set will get, the bigger the performance will differ in favor of `set` – David Arenburg Nov 02 '16 at 13:26
  • the above statement runs in loop and it executed for 20M(for loop inside a for loop) times. so for an iteration which takes only .2 milli sec it takes more than 1 hour to complete. that's y i'm interested in knowing the fastest way to updated a data set. I'm also thinking if bulk update might be a better solution, – sak88 Nov 02 '16 at 13:32