I ran each solution on my data set and compared the run times with rbenchmark.
I cannot share the data set but here some basic info:
dim(event_source_causal_parts)
[1] 311127 4
The code for the comparison,
require(rbenchmark)
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(c(0,diff(Source == "Warranty")) < 0))][Source != 'Warranty', prior := '']
})
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(shift(Source, fill="Warranty") == "Warranty"))][Source != 'Warranty', prior := '']
})
benchmark({
event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)]
indx <- setDT(event_source_causal_parts)[, list(.I[.N], paste(Causal_Part_Number[-.N], collapse = " ")),
by = list(c(0L, cumsum(Source == "Warranty")[-nrow(event_source_causal_parts)]))]
})
The outcome are as follows,
replications elapsed relative user.self sys.self user.child sys.child
1 100 12.91 1 12.76 0.05 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 12.7 1 12.66 0.05 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 61.97 1 61.65 0 NA NA
my environment,
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rbenchmark_1.0.0 stringr_0.6.2 data.table_1.9.5 vimcom_1.2-6
loaded via a namespace (and not attached):
[1] chron_2.3-45 grid_3.1.2 lattice_0.20-30 tools_3.1.2 zoo_1.7-11
R used the Intel MKL math libraries.
Based on these results I think that @akrun 's second solution is the fastest.
I ran the test again but now I recompiled data.table with -O3 and updated R to 3.2.0. The results are very different:
replications elapsed relative user.self sys.self user.child sys.child
1 100 21.22 1 20.73 0.48 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 11.31 1 10.39 0.92 NA NA
replications elapsed relative user.self sys.self user.child sys.child
1 100 35.77 1 35.53 0.25 NA NA
So the best solution is even faster under new R with O3 but the second best solution is much slower.