22

I'm supposed to download a table from MS-SQL server.

The number of row is larger than 6million. The server cannot return entire data at once.

So, I wrote a code that downloads 10,000 rows at a time. and, it binds rows in the loop.

Assume that getData() function returns a data frame contains 10000 rows at a time. (Pseudo Code)

for(i in 1:600)
{
    tempValue <- getData()
    wannagetValue <- rbind(wannagetValue,tempValue)
    print(i)
}

The problem is that it gets slower as time goes by.

I think using rbind like that way is not a good idea.

Any advice will be very helpful. Thank you in advance.

Keith Park
  • 577
  • 1
  • 5
  • 16
  • 3
    If you know how big your eventual data frame is going to be then it will be a lot faster to pre-allocate it and fill in values as you go. Although there is likely a faster solution using different data structures that someone will post. – David Oct 31 '13 at 03:00
  • wannagetValue <- rbind(wannagetValue,getData()); . That could speed up a little bit your code...but the main question is, why so many data? – Hackerman Oct 31 '13 at 03:02
  • Thank you David! You saved me.. Robert // I have no idea~~ all I know is that there are data over 500GB size. – Keith Park Oct 31 '13 at 03:34
  • You do realise that R stores everything in memory, right? If you have 500GB of data, you'll need at least that much RAM, and ideally twice as much. – Hong Ooi Oct 31 '13 at 05:35
  • @HongOoi // Entire raw data will be exceed 500GB, but it will be 5~10GB totla if it stores as binary data. Furthermore, Each table contains 1~2GB. I can deal with it separately. Fortunately, my computer has 64GB DDR3 ECC RAM. I don't think that's not enough. By the way, thank you for the advice. – Keith Park Oct 31 '13 at 07:06
  • Not to be a wet blanket, but are you sure `R` is the tool you want to use, rather than a local SQL application? – Carl Witthoft Oct 31 '13 at 13:44
  • This would be much faster: l <- vector("list", 600);for (i in 1:600) {l[[i]] <- getData()} wannagetValue <- do.call("rbind", l) – mdsumner Oct 31 '13 at 23:16

4 Answers4

37

Here are a few options that I'm sure could be better:

library(data.table)
library(microbenchmark)

#function to generate your data
getData <- function(){
  data.frame(x=rnorm(10000),y=rnorm(10000),z=rnorm(10000))
}

#using data table's rbindlist each iteration
fDT1 <- function(n){
  dat <- getData()
  for(i in 1:n){
    dat <- rbindlist(list(dat,getData()))
  }
  return(data.frame(dat))
}

#using data table's rbindlist all at once
fDT2 <- function(n){
  return(data.frame(rbindlist(lapply(1:n,function(x) getData()))))
}

#pre-allocating a data frame
fPre <- function(n){
  dat <- data.frame(x=rep(0,n*10000),y=rep(0,n*10000),z=rep(0,n*10000))
  j <- 1
  for(i in 1:n){
    dat[j:(j+10000-1),] <- getData()
    j <- j + 10000
  }
  return(dat)
}

#standard do.call rbind
f2 <- function(n){
  return(do.call(rbind,lapply(1:n,function(x) getData())))
}

#current approach
f <- function(n){
  dat <- getData()
  for(i in 1:n){
    dat <- rbind(dat,getData())
  }
  return(dat)
}

As you can see using data.table's rbindlist() is a big improvement over base R's rbind() and there is a big benefit in appending rows all at once instead of in interations, however that may not be possible if there are memory concerns. You may also note that the speed improvements are nowhere near linear as the size of data increases.

 > microbenchmark(fDT2(5),fDT1(5),fPre(5),f2(5),f(5),
+                fDT2(25),fDT1(25),fPre(25),f2(25),f(25),
+                fDT2(75),fDT1(75),fPre(75),f2(75),f(75),
+                times=10)
Unit: milliseconds
     expr        min         lq     median         uq         max neval
  fDT2(5)   18.31207   18.63969   24.09943   25.45590    72.01725    10
  fDT1(5)   27.65459   29.25147   36.34158   77.79446    88.82556    10
  fPre(5)   34.96257   39.39723   41.24445   43.30319    68.75897    10
    f2(5)   30.85883   33.00292   36.29100   43.53619    93.15869    10
     f(5)   87.40869   97.97500  134.50600  138.65354   147.67676    10
 fDT2(25)   89.42274   99.39819  103.90944  146.44160   156.01653    10
 fDT1(25)  224.65745  229.78129  261.52388  280.85499   300.93488    10
 fPre(25)  371.12569  412.79876  431.80571  485.37727  1046.96923    10
   f2(25)  221.03669  252.08998  265.17357  271.82414   281.47096    10
    f(25) 1446.32145 1481.01998 1491.59203 1634.99936  1849.00590    10
 fDT2(75)  326.66743  334.15669  367.83848  467.85480   520.27142    10
 fDT1(75) 1749.83842 1882.27091 2066.95241 2278.55589  2419.07205    10
 fPre(75) 3701.16220 3968.64643 4162.70585 4234.39716  4356.09462    10
   f2(75) 1174.47546 1183.98860 1314.64585 1421.09483  1537.42903    10
    f(75) 9139.36935 9349.24412 9510.90888 9977.24621 10861.51206    10
David
  • 9,284
  • 3
  • 41
  • 40
5

As it's been pointed out above, R stores all its objects in RAM by default, so with that amount of data, you are going to run into some problems.

Two things I'd like to add: 1) Generally, if you don't want to use data.table, you can use the rbind.fill function in Hadley's plyr package, which is quite fast, too. Never use rbind the way you did above, in a 'for' loop, appending each row separately. It forces R to make a copy of the data frame object every time you append one row, and that is slow.

2) To work with larger-than-RAM data with R, have a look at the section Large memory and out-of-memory data at http://cran.r-project.org/web/views/HighPerformanceComputing.html, maybe the bigmemory package is what you need.

coanil
  • 272
  • 1
  • 7
  • 1
    Entire raw data will be exceed 500GB, but it will be 5~10GB total if it stores as binary data. Furthermore, Each table contains 1~2GB. I can deal with it separately. Fortunately, my computer has 64GB DDR3 ECC RAM. I don't think that's not enough. By the way, thank you for the advice. – Keith Park Oct 31 '13 at 07:16
  • 6
    This question is a little old, but I still found it while searching for a solution today, so I want to add that Hadley's newer `dplyr` package has the function `bind_rows` which is analogous to `rbind.fill`. I benchmarked it and it runs about 1000x faster than `do.call('rbind', ...)` on my machine. See [this question](https://stackoverflow.com/questions/44464441/r-is-there-a-good-replacement-for-plyrrbind-fill-in-dplyr). – qdread Dec 11 '17 at 18:36
2

Another way in pure base using unlist might be fast.
Here a comparison with data.table::rbindlist based on the answer from @David.

getData <- (function(){
  set.seed(42)
  x <- data.frame(x=rnorm(10000), y=rnorm(10000), z=rnorm(10000))
  function() x
})()

M <- alist(
  rbind = do.call(rbind, lapply(1:n, function(x) getData())),
  rbind.fill = plyr::rbind.fill(lapply(1:n,function(x) getData())),
  rbindlist = data.frame(data.table::rbindlist(lapply(1:n,function(x) getData()))),
  bind_rows = dplyr::bind_rows(lapply(1:n,function(x) getData())),
  unlist = {. <- lapply(1:n, function(x) getData())
    list2DF(lapply(setNames(seq_along(.[[1]]), names(.[[1]])), \(i)
            unlist(lapply(., `[[`, i), FALSE, FALSE)))}
)

Results.

n <- 5
bench::mark(exprs = M)
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 rbind      846.37µs   1.08ms      825.    4.36MB    154.    252    47
#2 rbind.fill   1.03ms   1.13ms      606.    3.25MB     56.9   245    23
#3 rbindlist  310.98µs 474.46µs     1570.    2.33MB    122.    567    44
#4 bind_rows   183.2µs 280.31µs     2226.    1.15MB     75.4   915    31
#5 unlist     199.81µs 248.43µs     3588.    1.16MB    118.   1243    41

n <- 25
bench::mark(exprs = M)
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 rbind       17.84ms  22.01ms      27.5   65.34MB     72.8    14    37
#2 rbind.fill   4.97ms    8.2ms      87.7   16.22MB     41.9    44    21
#3 rbindlist    1.56ms   3.34ms     226.    11.48MB     90.0   113    45
#4 bind_rows  820.99µs   1.46ms     441.     5.72MB     69.8   221    35
#5 unlist     927.46µs   1.23ms     502.     5.72MB     83.6   252    42

n <- 75
bench::mark(exprs = M)
# expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 rbind      63.23ms 68.66ms      11.3   273.7MB     47.0     7    29      618ms
#2 rbind.fill  17.1ms 20.96ms      43.2    48.7MB     31.4    22    16      509ms
#3 rbindlist    5.2ms  7.44ms     117.     34.4MB     57.7    59    29      503ms
#4 bind_rows   3.92ms  5.37ms     155.     17.2MB     37.7    78    19      505ms
#5 unlist       2.9ms  4.23ms     193.     17.2MB     47.4    98    24      507ms

In this case the method using unlist is the fastest and allocates less memory compared to the other methods. dplyr::bind_rows uses as less memory and comes with timings close to unlist.
I also realized, that when running the same on a about 10 year old CPU the distance between unlist and data.table::rbindlist gets larger.
Have also a look at More efficiency creating a new variable using for loop with a different data structure of the list to rbind where do.call(rbind, LIST) is about 10 times faster than dplyr::bind_rows(LIST).

GKi
  • 37,245
  • 2
  • 26
  • 48
0

Perhaps you could do SELECT COUNT(*) FROM table_name WHERE ... and then preallocate space for your data frame.

Actually, I don't think querying your database per 10k rows is a good idea. Try to avoid that by exporting the data to a local disk and read from there. It will improve the speed as well. Storage is cheap, network bandwidth and memory are not.

sf_
  • 1,138
  • 2
  • 13
  • 28