Improve performance of inserting/repeating rows in data frame using the data.table package

Question

Full context of the problem can be found at https://github.com/ropensci/plotly/issues/981 , but in this example I've tried to eliminate as much extraneous information as possible.

For the plotly package to build shape objects with grouping information from an input data frame,data, the first row of each group needs to be repeated at the end of each group to connect the edges of a shape, and an empty row need to be added between groups so that edges are not connected between discrete groups. After manipulation, the output data, d, then needs to have the class and attributes restored to match those of the input, data.

The package maintainer (Carson Sievert) identified this operation as one of the most time consuming and memory intensive steps off generating certain types of plots and requested help in optimizing this operation using c++ in place of the existing dplyr::do(data, dplyr::arrange_(., allVars)) and dplyr::do(data, rbind.data.frame(., .[1,], NA)) operations.

Since the majority of time is spent subsetting and sorting the rows, this looked like an application where the indexed keys and binary search used by data.table would result in a substantial improvement without even turning to c++.

I started working on a data.table replacement based primarily on the answers to an existing question: Insert a row of NAs after each group of data using data.table, but I'm not getting the kind of performance improvement that would justify swapping up package dependencies. (depending size of the data.frame, number of groups, etc. this version only runs 1-10x faster than the existing function used in the package)

library(data.table)
options(datatable.verbose=TRUE)

local_group2NA <- function(data, groupNames = "group", nested = NULL, ordered = NULL,
                     retrace.first = inherits(data, "GeomPolygon")) {

  ## store class information from function input
  retrace <- force(retrace.first)
  datClass <- class(data)

  allVars <- c(nested, groupNames, ordered)

  ## if retrace.first is TRUE,repeat the first row of each group and add an empty row of NA's after each group
  ## if retrace.first is FALSE,, just add an empty row to each group
  d <- if (retrace.first) {
    data.table::setDT(data, key = allVars)[, index := .GRP, by = allVars][, .SD[c(1:(.N),1,(.N+1))], keyby = index][,index := NULL]
  } else {
    data.table::setDT(data, key = allVars)[, index := .GRP, by = allVars][, .SD[1:(.N+1)], keyby = index][,index := NULL]
  }

  ## delete last row if all NA's
  if (all(is.na(d[.N, ]))) d <- d[-.N,]

  ## return d with the original class
  structure(d, class = datClass)
}

Using the mtcars dataset yields the following output (with data.table verbose output):

> local_group2NA(mtcars, "vs","cyl",retrace.first = TRUE)

forder took 0 sec
x is already ordered by these columns, no need to call reorder
Detected that j uses these columns: <none> 
Finding groups using uniqlist ... 0 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
Optimization is on but left j unchanged (single plain symbol): '.GRP'
Making each group and running j (GForce FALSE) ... 
  memcpy contiguous groups took 0.000s for 5 groups
  eval(j) took 0.000s for 5 calls
0 secs
Finding groups using forderv ... 0 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
lapply optimization is on, j unchanged as '.SD[c(1:(.N), 1, (.N + 1))]'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dogroups: growing from 32 to 57 rows
Wrote less rows (42) than allocated (57).

  memcpy contiguous groups took 0.000s for 5 groups
  eval(j) took 0.005s for 5 calls
0.004 secs
Detected that j uses these columns: index 
Assigning to all 42 rows
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
3    NA  NA    NA  NA   NA    NA    NA NA NA   NA   NA
4  21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
5  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
6  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
7  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
9  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
10 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
11 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
12 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
13 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
14 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
15   NA  NA    NA  NA   NA    NA    NA NA NA   NA   NA
16 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
17 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
18 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
19 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
20   NA  NA    NA  NA   NA    NA    NA NA NA   NA   NA
21 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
22 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
23 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
24 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
25 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
26   NA  NA    NA  NA   NA    NA    NA NA NA   NA   NA
27 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
28 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
29 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
30 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
31 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
32 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
33 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
34 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
35 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
36 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
37 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
38 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
39 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
40 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
41 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

One section in particular of the verbose output catches my eye as a potential explanation for the less than desired performance:

Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Is there a better way that I could express j to eliminate the issue explained in the verbose message?

Benchmarking with the following steps shows that the run-time appears to be increasing proportionally with the size of the input data frame, which I would hope could be improved by one or more orders of magnitude with the right methods.

library(data.table)
options(datatable.verbose=FALSE)
library(microbenchmark)

exampleData <- function(nChunks = 100, nPerChunk = 100) {
  vals <- c(replicate(nChunks, rnorm(nPerChunk)))
  ids <- replicate(nChunks, basename(tempfile("")))
  ids <- rep(ids, each = nPerChunk)
  data.frame(vals, group = ids, stringsAsFactors = FALSE)
}


x1 <- exampleData(1e1)
x2 <- exampleData(1e2)
x3 <- exampleData(1e3)
x4 <- exampleData(1e4)
x5 <- exampleData(1e5)

res <- microbenchmark(
  local_group2NA(x1, retrace.first = TRUE),
  local_group2NA(x2, retrace.first = TRUE),
  local_group2NA(x3, retrace.first = TRUE),
  local_group2NA(x4, retrace.first = TRUE),
  local_group2NA(x5, retrace.first = TRUE),
  times = 1
)

res
Unit: milliseconds
                                     expr         min          lq        mean      median          uq         max neval
 local_group2NA(x1, retrace.first = TRUE)    41.12776    41.12776    41.12776    41.12776    41.12776    41.12776     1
 local_group2NA(x2, retrace.first = TRUE)    30.07690    30.07690    30.07690    30.07690    30.07690    30.07690     1
 local_group2NA(x3, retrace.first = TRUE)   270.07541   270.07541   270.07541   270.07541   270.07541   270.07541     1
 local_group2NA(x4, retrace.first = TRUE)  2779.03229  2779.03229  2779.03229  2779.03229  2779.03229  2779.03229     1
 local_group2NA(x5, retrace.first = TRUE) 28920.51861 28920.51861 28920.51861 28920.51861 28920.51861 28920.51861     1

In addition to optimizing the method I'm currently using, I would also appreciate any suggestions of other methods to attain the results in a faster manner (i.e. using rbindlist(), c++, etc.

UPDATE:

Huge thanks to Frank, his comments below have this running an order of magnitude faster now. Running the same benchmark yields the following:

Unit: milliseconds
                                        expr         min          lq        mean      median          uq         max neval
 plotly:::group2NA(x1, retrace.first = TRUE)    3.106003    3.106003    3.106003    3.106003    3.106003    3.106003     1
 plotly:::group2NA(x2, retrace.first = TRUE)    4.583826    4.583826    4.583826    4.583826    4.583826    4.583826     1
 plotly:::group2NA(x3, retrace.first = TRUE)   10.821644   10.821644   10.821644   10.821644   10.821644   10.821644     1
 plotly:::group2NA(x4, retrace.first = TRUE)   93.619315   93.619315   93.619315   93.619315   93.619315   93.619315     1
 plotly:::group2NA(x5, retrace.first = TRUE) 1195.372013 1195.372013 1195.372013 1195.372013 1195.372013 1195.372013     1

Updated function:

local_group2NA <- function(data, groupNames = "group", nested = NULL, ordered = NULL,
                     retrace.first = inherits(data, "GeomPolygon")) {

  ## store class information from function input
  retrace <- force(retrace.first)
  datClass <- class(data)

  allVars <- c(nested, groupNames, ordered)

  ## if retrace.first is TRUE,repeat the first row of each group and add an empty row of NA's after each group
  ## if retrace.first is FALSE,, just add an empty row to each group
  d <- if (retrace.first) {
    data.table::setDT(data, key = allVars)[ data[, .I[c(seq_along(.I), 1L, .N+1L)], by=allVars]$V1 ]
  } else {
    data.table::setDT(data, key = allVars)[ data[, .I[c(seq_along(.I), 1L, .N+1L)], by=allVars]$V1 ]
  }

  ## delete last row if all NA's
  if (all(is.na(d[.N, ]))) d <- d[-.N,]

  ## return d with the original class
  structure(d, class = datClass)
}

Also, for reference, the verbose output for the mtcars example above with the updated function is:

forder took 0 sec
reorder took 0.006 sec
Detected that j uses these columns: <none> 
Finding groups using uniqlist ... 0.001 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
lapply optimization is on, j unchanged as '.I[c(seq_along(.I), 1L, .N + 1L)]'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... dogroups: growing from 32 to 57 rows
Wrote less rows (42) than allocated (57).

  memcpy contiguous groups took 0.000s for 5 groups
  eval(j) took 0.000s for 5 calls
0.001 secs

Don't put generation of example data inside the benchmark, since it also is likely to take a good chunk of time. Also, please make a minimal example next time. We don't need to know about your three things that go into grouping. Anyways, this seems a lot faster: `f <- function(d, g = "group") d[ d[, .I[c(seq_along(.I), 1L, .N+1L)], by=g]$V1 ]` taken from http://stackoverflow.com/questions/16573995/subset-by-group-with-data-table/16574176#16574176 — Frank, May 09 '17 at 16:58
Thanks for the benchmark advice, feeling a little sheepish on that one, updated code and benchmark results following that. Trying out your recommendation now! — Matt Summersgill, May 09 '17 at 17:11
Need to do some additional testing to make sure it works as expected when integrated into the rest of the package, but your suggestion just knocked that last group in the benchmark from 28290 ms down to 1207 ms! — Matt Summersgill, May 09 '17 at 17:26
related: http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table — MichaelChirico, May 12 '17 at 15:22

Improve performance of inserting/repeating rows in data frame using the data.table package

0 Answers0