46

In this question the data.table package creator explains why rows cannot be inserted (or removed) by reference in the middle a data.table yet. He also points out that such operations could be possible at end of the table. Could you show a code to perfome this action? It would be the "by reference" version of

a<- data.table(id=letters[1:2], var=1:2)
> a
   id var
1:  a   1
2:  b   2
> rbind(a, data.table(id="c", var=3))
   id var
1:  a   1
2:  b   2
3:  c   3

thanks.

EDIT:

since a proper solution is not possible yet, which of the following is better (if internally different, not sure) either from a speed and memory usage perpective?

rbind(a, data.table(id="c", var=3))

rbindlist(list(a,  data.table(id="c", var=3)))

are there eventually other (better) methods?

Community
  • 1
  • 1
Michele
  • 8,563
  • 6
  • 45
  • 72
  • 4
    Sorry, not yet implemented. The words "could be" and "would be" in that answer are meant with future tense. – Matt Dowle May 28 '13 at 13:00
  • @MatthewDowle Hi. I see, you meant the speed (compared to SQL) and said "could be inserted (and deleted) at the end, **instantly**" as a possibility... sorry my fault. I'll make a change to the question to give it more sense then. – Michele May 28 '13 at 13:11
  • any one to add such row in-line of DT processing? like below: DT[,transformation][,transformation2][,transformation3][,transformation4,by='abc'][add_grand_total_summary_row] it is easy to put grand total in a additional column but it is not so elegant. – jangorecki Dec 15 '13 at 00:26
  • 1
    @MattDowle can you at least add a [tracking FR](https://github.com/Rdatatable/data.table/issues) so we can check back if there's a version commitment? – smci Apr 03 '15 at 09:43
  • 2
    Checkout the benchmark of [4 different methods of appending multiple rows in place when the number of rows is not known in advance](http://stackoverflow.com/questions/20689650/how-to-append-rows-to-an-r-data-frame/38052208#38052208). – Adam Ryczkowski Jun 27 '16 at 17:33

1 Answers1

27

To answer your edit, just run a benchmark:

a = data.table(id=letters[1:2], var=1:2)
b = copy(a)
c = copy(b) # let's also just try modifying same value in place
            # to see how well changing existing values does
microbenchmark(a <- rbind(a, data.table(id="c", var=3)),
               b <- rbindlist(list(b,  data.table(id="c", var=3))),
               c[1, var := 3L],
               set(c, 1L, 2L, 3L))
#Unit: microseconds
#                                                  expr     min        lq    median        uq      max neval
#          a <- rbind(a, data.table(id = "c", var = 3)) 865.460 1141.2585 1357.1230 1539.4300 6814.492   100
#b <- rbindlist(list(b, data.table(id = "c", var = 3))) 260.440  325.3835  445.4190  522.8825 1143.930   100
#                                   c[1, `:=`(var, 3L)] 482.147  626.5570  778.3135  904.3595 1109.539   100
#                                    set(c, 1L, 2L, 3L)   2.339    5.677    7.5140    9.5170   19.033   100

rbindlist is clearly better than rbind. Thanks to Matthew Dowle pointing out the problems with using [ in a loop, I added another benchmark with set.

From the above your best options are using rbindlist, or sizing the data.table to begin with and then just populating the values (you can also use a similar strategy to std::vector in C++, and double the size every time you run out of space, if you don't know the size of the data to begin with, and then once you're done filling it in, delete the extra rows).

Frank
  • 66,179
  • 8
  • 96
  • 180
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 4
    Nice. On the surprise that's probably the overhead of calling `[.data.table` many times, since `microbenchmark` calls it 100 times in this example. Try `set()` instead for a loopable `:=`. – Matt Dowle May 28 '13 at 17:11
  • and if you set `id` as the key of `c` and you change `c[1, var := 3L]` in `c["a", var := 3L]` this one is even slower then the first. Anyway, thanks a lot. I could've done by myself I know but I'm more then new to `data.table` and I wanted to get the most from the question (e.g. I didn't know `copy`!) – Michele May 28 '13 at 17:20
  • @Matthew Dowle 25% faster then `rbindlist` – Michele May 28 '13 at 17:23
  • Don't know if the functionality of `set()` has changed over the years but the current code is not appending a row to the original data but updating a single value... don't think it can add rows in 2019. – s_baldur Feb 04 '19 at 10:37
  • @snoram it didn't in 2013 either :) You can look at the history of edits to understand why `set` was added to benchmark - basically it provides a lower bound. – eddi Feb 04 '19 at 17:01