The core of the problem is the expansion of the values in the Key
columns into i
.
Here is another data.table
solution employing melt()
but differing in implementation details from David's comment:
library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
expanded
Year col i
1: 2001 1 1
2: 2001 1 2
3: 2001 1 3
4: 2001 1 4
5: 2001 1 5
---
2571: 2003 4 381
2572: 2003 4 382
2573: 2003 4 383
2574: 2003 4 384
2575: 2003 4 385
The remaining computations can be done like this (if I've understood OP's intention right)
set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df
Year col i p value
1: 2001 1 1 0.2875775 1.437888
2: 2001 1 2 0.7883051 3.941526
3: 2001 1 3 0.4089769 2.044885
4: 2001 1 4 0.8830174 4.415087
5: 2001 1 5 0.9404673 4.702336
---
2571: 2003 4 381 0.4711072 17.355536
2572: 2003 4 382 0.5323359 17.661680
2573: 2003 4 383 0.3953954 16.976977
2574: 2003 4 384 0.4544372 17.272186
2575: 2003 4 385 0.1149009 15.574505
Benchmarking the different approaches
As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:
Benchmark code
For benchmarking, the microbenchmark
package is used.
library(magrittr)
bm <- microbenchmark::microbenchmark(
david1 = {
expanded_david1 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
)[, i := seq_len(.N), by = .(Year, col)]
},
david2 = {
expanded_david2 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
},
uwe = {
expanded_uwe <-
melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
},
ycw = {
expanded_ycw <- DT %>%
tidyr::gather(col, i, - Year) %>%
dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
dplyr::rowwise() %>%
dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
dplyr::select(Year, i, col) %>%
dplyr::arrange(Year, col, i)
},
times = 100L
)
bm
Note that references to tidyverse
functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2
variant converts factors to numbers of levels.
Timing the small sample data set
With the small sample data set with 3 years and 4 Key
columns provided by the OP the timings are as follows:
Unit: microseconds
expr min lq mean median uq max neval
david1 993.418 1161.4415 1260.4053 1244.320 1350.987 2000.805 100
david2 1261.500 1393.2760 1624.5298 1568.097 1703.837 5233.280 100
uwe 825.772 865.4175 979.2129 911.860 1084.226 1409.890 100
ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817 100
Even for this small problem size, the data.table
solutions are magnitudes faster than the tidyverse
approach with slight advantages for solution uwe
.
The results are checked to be equal:
all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE
Except for david1
which returns factors instead of integers and a different ordering, all four results are identical.
Larger benchmark case
Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key
columns. In the sample data set the overall mean of Key
values is 215. With these parameters, a larger data set is being created:
n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]
The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:
Unit: milliseconds
expr min lq mean median uq max neval
david1 2.512805 2.648735 2.726743 2.697065 2.698576 3.076535 5
david2 2.791838 2.816758 2.998828 3.068605 3.075780 3.241160 5
uwe 1.329088 1.453312 1.585390 1.514857 1.634551 1.995142 5
ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047 5
For this problem size, uwe
is nearly twice as fast as the other data.table
implementations. The tidyverse
approach is still magnitudes slower.