Fast and efficient way to expand a dataset in R

Question

I try to expand a dataset in R using the values from different columns (Key2 - KeyX) and then use the column number in a formula do compute some value.

Example of a part of the dataset I want to expand

Year Key2 Key3 Key4 Key5 ...
2001  150  105  140  140
2002  130   70   55   80
2003  590  375  355  385
...

Preferred result.

i = index number
col = column number (Key2 = 1, Key3 = 2, etc.)
p = random number

value = value calculated with the column number and p

year   i col         p     value
2001   1   1 0.7481282 4.0150810
2001   2   1 0.8449366 2.0735090
2001 ...   1 0.1906882 0.9534411
2001 150   1 0.8030162 3.7406410
2001   1   2 0.4147019 4.2246831
2001   2   2 0.3716995 1.8584977
2001 ...   2 0.5280272 2.6401361
2001 105   2 0.8030162 3.7406410
2001   1   3 0.7651376 3.8256881
2001   2   3 0.2298984 1.1494923
2001 ...   3 0.5607825 2.8039128
2001 140   3 0.7222644 3.6113222
etc.

2002   1   1 0.1796613 0.8983065
2002   2   1 0.6390833 3.1954165
2002 ...   1 0.5280272 2.6401367
2002 130   1 0.4238842 2.1194210
2002   1   2 0.7651376 3.8256889
2002   2   2 0.2298984 1.1494928
2002 ...   2 0.5607825 2.8039125
2002  70   2 0.7222644 3.6113227
2002   1   3 0.7512801 3.7564000
2002   2   3 0.4484248 2.2421240
2002 ...   3 0.5662704 2.8313520
2002  55   3 0.7685377 3.8426884
etc.

I use the following code in R, but it is very slow with a large dataset. I tried to keep the use of loops to a minimum by using rep() but I still have to for-loops in the code.

Is there a faster / more efficient way to do this is R? Using data.table?

val <- c(); i <- c(); cols <- c(); p <- c(); year <- c()
for (year in 1:10) {
  for (n in 2:25) {
      c <- n-1
      pu <- runif(dataset1[[year, n]])
      p <- c(p, pu )
      tmp <- (c-1)*5 + 5*pu
      val <- c(val, tmp)
      ##
      i <- c(i, 1:dataset1[[year, n]])
      cols <- c(cols, rep(c, dataset1[[year, n]]) )
      year <- c(year, rep(dataset1[[year,1]], dataset1[[year, n]]) )
  }
}
res.df <- data.frame(year=year, i=i, cols=cols, p=p, val=val)
res.df <- setDT(res.df)

You really really need to vectorize your code. You can easily create everything up to the `p` and `value` stage in a vectorized manner. For instance `library(data.table) ; res <- setorder(melt(setDT(DT), id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]`. I don't understand how are you creating `p` and `value`, but `runif` is **vectotrized**, **don't use loops** in order to run vectorized actions. You can create both of these columns in a single call if you could explain what on earth are you actually doing. — David Arenburg, Jun 25 '17 at 14:11
Thanks for the `data.table` suggestion. So much faster than loops. Some more background information. The values in the initial dataset represent the number of people in a certain age group with a range of 5 years. Key2: age 0-5 yrs; Key3: age 5-10 yrs; etc. With the formula, I try to simulate the ages for the individuals in each category. — Steve, Jun 26 '17 at 15:22
I would like to add this question as well, as reference: https://stackoverflow.com/questions/40175658/efficient-way-of-appending-rows-to-data-table Perhaps you could speed up some of the calculations, by only expanding the data.table if there are no more empty rows available. — hannes101, Jun 27 '17 at 14:23
Possible duplicate of [Add a row by reference at the end of a data.table object](https://stackoverflow.com/questions/16792001/add-a-row-by-reference-at-the-end-of-a-data-table-object) — hannes101, Jun 27 '17 at 14:25
@hannes101 I don't believe this is a good dupe target. The question is _not_ about adding rows at the end of a data.table but to create a substantially larger data set, i.e., the above 3 rows are eventually expanded to a new data set of 2575 rows which is a 900-fold increase. However, using `data.table`'s fast `set()` operation might be an alternative to the other answers. If you post a solution I'll be happy to include this in my benchmark. — Uwe, Jun 28 '17 at 06:14

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

The core of the problem is the expansion of the values in the Key columns into i.

Here is another data.table solution employing melt() but differing in implementation details from David's comment:

library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
  , .(i = seq_len(value)), by = .(Year, col)]
expanded

      Year col   i
   1: 2001   1   1
   2: 2001   1   2
   3: 2001   1   3
   4: 2001   1   4
   5: 2001   1   5
  ---             
2571: 2003   4 381
2572: 2003   4 382
2573: 2003   4 383
2574: 2003   4 384
2575: 2003   4 385

The remaining computations can be done like this (if I've understood OP's intention right)

set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df

      Year col   i         p     value
   1: 2001   1   1 0.2875775  1.437888
   2: 2001   1   2 0.7883051  3.941526
   3: 2001   1   3 0.4089769  2.044885
   4: 2001   1   4 0.8830174  4.415087
   5: 2001   1   5 0.9404673  4.702336
  ---                                 
2571: 2003   4 381 0.4711072 17.355536
2572: 2003   4 382 0.5323359 17.661680
2573: 2003   4 383 0.3953954 16.976977
2574: 2003   4 384 0.4544372 17.272186
2575: 2003   4 385 0.1149009 15.574505

Benchmarking the different approaches

As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:

David's data.table solution plus a modification which ensures the result is identical with the expected result
ycw's tidyverse solution
my data.table solution

Benchmark code

For benchmarking, the microbenchmark package is used.

library(magrittr)
bm <- microbenchmark::microbenchmark(
  david1 = {
    expanded_david1 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
      )[, i := seq_len(.N), by = .(Year, col)]
  },
  david2 = {
    expanded_david2 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
          rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
  },
  uwe = {
    expanded_uwe <- 
      melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
        , .(i = seq_len(value)), by = .(Year, col)]
  },
  ycw = {
    expanded_ycw <- DT %>%
      tidyr::gather(col, i, - Year) %>%
      dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
      dplyr::rowwise() %>%
      dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
      dplyr::select(Year, i, col) %>%
      dplyr::arrange(Year, col, i)
  },
  times = 100L
)
bm

Note that references to tidyverse functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2 variant converts factors to numbers of levels.

Timing the small sample data set

With the small sample data set with 3 years and 4 Key columns provided by the OP the timings are as follows:

Unit: microseconds
   expr       min         lq        mean    median         uq        max neval
 david1   993.418  1161.4415   1260.4053  1244.320   1350.987   2000.805   100
 david2  1261.500  1393.2760   1624.5298  1568.097   1703.837   5233.280   100
    uwe   825.772   865.4175    979.2129   911.860   1084.226   1409.890   100
    ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817   100

Even for this small problem size, the data.table solutions are magnitudes faster than the tidyverse approach with slight advantages for solution uwe.

The results are checked to be equal:

all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE

Except for david1 which returns factors instead of integers and a different ordering, all four results are identical.

Larger benchmark case

Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key columns. In the sample data set the overall mean of Key values is 215. With these parameters, a larger data set is being created:

n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]

The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 david1    2.512805    2.648735    2.726743    2.697065    2.698576    3.076535     5
 david2    2.791838    2.816758    2.998828    3.068605    3.075780    3.241160     5
    uwe    1.329088    1.453312    1.585390    1.514857    1.634551    1.995142     5
    ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047     5

For this problem size, uwe is nearly twice as fast as the other data.table implementations. The tidyverse approach is still magnitudes slower.

Thanks for this great answer and clear explanation. This code runs so much faster than my original code. — Steve, Jun 26 '17 at 15:25
Hi @UweBlock, I have another approach from `tidyverse` using `map2` and `unnest`. Please see my update in my answer. Would you mind if you also do benchmarking for this one? There is no dought that `data.table` is faster, but I think it would be good to know if it is a good rule to avoid `rowwise` and `do` if possible. — www, Jun 26 '17 at 15:27
@ycw Yes, you're welcome. My idea is to add benchmark runs for larger problem cases by varying the 3 parameters which influence the problem size. — Uwe, Jun 26 '17 at 15:30

www · Answer 2 · 2017-06-26T15:22:24.640

Here is one idea. df2 contains the expanded Year, col, and i. You can further create p and value for df2.

# Load package
library(tidyverse)

# Create example data frame
dt <- read.table(text = "Year Key2 Key3 Key4 Key5
2001  150  105  140  140
                 2002  130   70   55   80
                 2003  590  375  355  385",
                 header = TRUE, stringsAsFactors = FALSE) 


# Expand the data frame
dt2 <- dt %>%
  gather(col, i, - Year) %>%
  mutate(col = as.numeric(sub("Key", "", col)) - 1) %>%
  rowwise() %>%
  do(data_frame(Year = .$Year, col = .$col, i = seq(1, .$i, 1))) %>%
  select(Year, i, col) %>%
  arrange(Year, col, i)

Update

Another approach from the tidyverse.

# Expand the data frame
dt2 <- dt %>%
  gather(col, i, - Year) %>%
  mutate(col = as.numeric(sub("Key", "", col)) - 1) %>%
  mutate(i = map2(1L, i, seq, by = 1)) %>%
  unnest() %>%
  select(Year, i, col) %>%
  arrange(Year, col, i)