Full context of the problem can be found at https://github.com/ropensci/plotly/issues/981 , but in this example I've tried to eliminate as much extraneous information as possible.
For the plotly
package to build shape objects with grouping information from an input data frame,data
, the first row of each group needs to be repeated at the end of each group to connect the edges of a shape, and an empty row need to be added between groups so that edges are not connected between discrete groups. After manipulation, the output data, d
, then needs to have the class and attributes restored to match those of the input, data
.
The package maintainer (Carson Sievert) identified this operation as one of the most time consuming and memory intensive steps off generating certain types of plots and requested help in optimizing this operation using c++
in place of the existing dplyr::do(data, dplyr::arrange_(., allVars))
and dplyr::do(data, rbind.data.frame(., .[1,], NA))
operations.
Since the majority of time is spent subsetting and sorting the rows, this looked like an application where the indexed keys and binary search used by data.table
would result in a substantial improvement without even turning to c++
.
I started working on a data.table
replacement based primarily on the answers to an existing question: Insert a row of NAs after each group of data using data.table, but I'm not getting the kind of performance improvement that would justify swapping up package dependencies. (depending size of the data.frame, number of groups, etc. this version only runs 1-10x faster than the existing function used in the package)
library(data.table)
options(datatable.verbose=TRUE)
local_group2NA <- function(data, groupNames = "group", nested = NULL, ordered = NULL,
retrace.first = inherits(data, "GeomPolygon")) {
## store class information from function input
retrace <- force(retrace.first)
datClass <- class(data)
allVars <- c(nested, groupNames, ordered)
## if retrace.first is TRUE,repeat the first row of each group and add an empty row of NA's after each group
## if retrace.first is FALSE,, just add an empty row to each group
d <- if (retrace.first) {
data.table::setDT(data, key = allVars)[, index := .GRP, by = allVars][, .SD[c(1:(.N),1,(.N+1))], keyby = index][,index := NULL]
} else {
data.table::setDT(data, key = allVars)[, index := .GRP, by = allVars][, .SD[1:(.N+1)], keyby = index][,index := NULL]
}
## delete last row if all NA's
if (all(is.na(d[.N, ]))) d <- d[-.N,]
## return d with the original class
structure(d, class = datClass)
}
Using the mtcars
dataset yields the following output (with data.table verbose output):
> local_group2NA(mtcars, "vs","cyl",retrace.first = TRUE)
forder took 0 sec
x is already ordered by these columns, no need to call reorder
Detected that j uses these columns: <none>
Finding groups using uniqlist ... 0 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
Optimization is on but left j unchanged (single plain symbol): '.GRP'
Making each group and running j (GForce FALSE) ...
memcpy contiguous groups took 0.000s for 5 groups
eval(j) took 0.000s for 5 calls
0 secs
Finding groups using forderv ... 0 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
lapply optimization is on, j unchanged as '.SD[c(1:(.N), 1, (.N + 1))]'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dogroups: growing from 32 to 57 rows
Wrote less rows (42) than allocated (57).
memcpy contiguous groups took 0.000s for 5 groups
eval(j) took 0.005s for 5 calls
0.004 secs
Detected that j uses these columns: index
Assigning to all 42 rows
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
3 NA NA NA NA NA NA NA NA NA NA NA
4 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
5 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
6 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
7 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
10 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
11 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
12 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
13 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
14 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
15 NA NA NA NA NA NA NA NA NA NA NA
16 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
17 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
18 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
19 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
20 NA NA NA NA NA NA NA NA NA NA NA
21 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
22 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
23 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
24 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
25 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
26 NA NA NA NA NA NA NA NA NA NA NA
27 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
28 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
29 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
30 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
31 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
32 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
33 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
34 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
35 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
36 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
37 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
38 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
39 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
40 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
41 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
One section in particular of the verbose output catches my eye as a potential explanation for the less than desired performance:
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
Is there a better way that I could express j
to eliminate the issue explained in the verbose message?
Benchmarking with the following steps shows that the run-time appears to be increasing proportionally with the size of the input data frame, which I would hope could be improved by one or more orders of magnitude with the right methods.
library(data.table)
options(datatable.verbose=FALSE)
library(microbenchmark)
exampleData <- function(nChunks = 100, nPerChunk = 100) {
vals <- c(replicate(nChunks, rnorm(nPerChunk)))
ids <- replicate(nChunks, basename(tempfile("")))
ids <- rep(ids, each = nPerChunk)
data.frame(vals, group = ids, stringsAsFactors = FALSE)
}
x1 <- exampleData(1e1)
x2 <- exampleData(1e2)
x3 <- exampleData(1e3)
x4 <- exampleData(1e4)
x5 <- exampleData(1e5)
res <- microbenchmark(
local_group2NA(x1, retrace.first = TRUE),
local_group2NA(x2, retrace.first = TRUE),
local_group2NA(x3, retrace.first = TRUE),
local_group2NA(x4, retrace.first = TRUE),
local_group2NA(x5, retrace.first = TRUE),
times = 1
)
res
Unit: milliseconds
expr min lq mean median uq max neval
local_group2NA(x1, retrace.first = TRUE) 41.12776 41.12776 41.12776 41.12776 41.12776 41.12776 1
local_group2NA(x2, retrace.first = TRUE) 30.07690 30.07690 30.07690 30.07690 30.07690 30.07690 1
local_group2NA(x3, retrace.first = TRUE) 270.07541 270.07541 270.07541 270.07541 270.07541 270.07541 1
local_group2NA(x4, retrace.first = TRUE) 2779.03229 2779.03229 2779.03229 2779.03229 2779.03229 2779.03229 1
local_group2NA(x5, retrace.first = TRUE) 28920.51861 28920.51861 28920.51861 28920.51861 28920.51861 28920.51861 1
In addition to optimizing the method I'm currently using, I would also appreciate any suggestions of other methods to attain the results in a faster manner (i.e. using rbindlist()
, c++
, etc.
UPDATE:
Huge thanks to Frank, his comments below have this running an order of magnitude faster now. Running the same benchmark yields the following:
Unit: milliseconds
expr min lq mean median uq max neval
plotly:::group2NA(x1, retrace.first = TRUE) 3.106003 3.106003 3.106003 3.106003 3.106003 3.106003 1
plotly:::group2NA(x2, retrace.first = TRUE) 4.583826 4.583826 4.583826 4.583826 4.583826 4.583826 1
plotly:::group2NA(x3, retrace.first = TRUE) 10.821644 10.821644 10.821644 10.821644 10.821644 10.821644 1
plotly:::group2NA(x4, retrace.first = TRUE) 93.619315 93.619315 93.619315 93.619315 93.619315 93.619315 1
plotly:::group2NA(x5, retrace.first = TRUE) 1195.372013 1195.372013 1195.372013 1195.372013 1195.372013 1195.372013 1
Updated function:
local_group2NA <- function(data, groupNames = "group", nested = NULL, ordered = NULL,
retrace.first = inherits(data, "GeomPolygon")) {
## store class information from function input
retrace <- force(retrace.first)
datClass <- class(data)
allVars <- c(nested, groupNames, ordered)
## if retrace.first is TRUE,repeat the first row of each group and add an empty row of NA's after each group
## if retrace.first is FALSE,, just add an empty row to each group
d <- if (retrace.first) {
data.table::setDT(data, key = allVars)[ data[, .I[c(seq_along(.I), 1L, .N+1L)], by=allVars]$V1 ]
} else {
data.table::setDT(data, key = allVars)[ data[, .I[c(seq_along(.I), 1L, .N+1L)], by=allVars]$V1 ]
}
## delete last row if all NA's
if (all(is.na(d[.N, ]))) d <- d[-.N,]
## return d with the original class
structure(d, class = datClass)
}
Also, for reference, the verbose output for the mtcars
example above with the updated function is:
forder took 0 sec
reorder took 0.006 sec
Detected that j uses these columns: <none>
Finding groups using uniqlist ... 0.001 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
lapply optimization is on, j unchanged as '.I[c(seq_along(.I), 1L, .N + 1L)]'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... dogroups: growing from 32 to 57 rows
Wrote less rows (42) than allocated (57).
memcpy contiguous groups took 0.000s for 5 groups
eval(j) took 0.000s for 5 calls
0.001 secs