add duplicate rows to R dataframe based on sequence

Question

My dataframe needs to be expanded

df1<-structure(list(TotalTime = c(0, 15, 16, 23, 24, 29), PhaseName = structure(c(1L,1L, 2L, 2L, 2L, 3L), .Label = c("A", "B","C"), class = "factor")), .Names = c("TotalTime", "Phase"), row.names = c(NA,  6L), class = "data.frame")

df1:

  TotalTime Phase
1         0     A
2        15     A
3        16     B
4        23     B
5        24     B
6        29     C

So that it becomes the following dataframe with rows that are duplicated based on TotalTime, however TotalTime should be filled in for every number (second). (I put ... in the example to reduce space, but should be filled with 6,7,8,9-15 etc.) :

    TotalTime   Phase
1   0   A
2   1   A
3   2   A
4   3   A
5   4   A
6   5   A
    ..  
16  15  A
17  16  B
18  17  B
    ..  B
24  23  B
25  24  B
26  25  B
27  26  B
28  27  B
29  28  B
30  29  C

score 2 · Accepted Answer · edited May 23 '17 at 10:33

2

using both packages zoo and dplyr:

library(dplyr)
library(zoo)
data.frame(TotalTime=0:max(df1$TotalTime)) %>% left_join(df1) %>% na.locf

It first creates a data.frame that has the hole sequence from 0 to 29 (here) and merges it with your data. Then I simply do a "last observation carried forward" imputation on the missing values created by the merge.

It can also be done with the library data.table like this: (see also this answer that I adapted:

library(data.table)
df1 = data.table(df1, key="TotalTime")
df2=data.table(TotalTime=0:max(df1$TotalTime))
df1[df2, roll=T]

edited May 23 '17 at 10:33

Community

1
1

answered Nov 25 '16 at 14:15

agenis

8,069
5
53
102

1

The first solution could also be done without dplyr by using `merge` like this:`na.locf(merge(data.frame(TotalTime=0:max(df1$TotalTime)), df1, all.x = TRUE))` – G. Grothendieck Nov 26 '16 at 14:31

Abdou · Answer 2 · 2016-11-25T14:37:07.253

You can get it done with dplyr with tidyr:

library(tidyverse)

df1 %>% do(data.frame(TotalTime = first(.$TotalTime):last(.$TotalTime))) %>%
    left_join(df1, by="TotalTime") %>%
    fill(Phase)

Output:

 TotalTime Phase
  0        A    
  1        A    
  2        A    
  3        A    
  4        A    
  5        A    
  6        A    
  7        A    
  8        A    
  9        A    
 10        A    
 11        A    
 12        A    
 13        A    
 14        A    
 15        A    
 16        B    
 17        B    
 18        B    
 19        B    
 20        B    
 21        B    
 22        B    
 23        B    
 24        B    
 25        B    
 26        B    
 27        B    
 28        B    
 29        C

I hope this helps.

score 0 · Answer 3 · answered Nov 25 '16 at 14:40

In case you want to see a base R solution.

phases <- with(aggregate(TotalTime~Phase, df1, FUN=min),
               rep(Phase, c(diff(TotalTime),
                            max(df1$TotalTime[df1$Phase == tail(Phase, 1)]) -
                            min(df1$TotalTime[df1$Phase == tail(Phase, 1)])+1)))

The main "trick" here is in that the second argument of rep can be a vector, which then repeats each element of the first argument that many times. The second argument is constructed using the difference of the minimum values of each phase diff(TotalTime) and concatenating the difference of the min and max value (+1) of the final phase level (here, "C"). The minimum values are found with aggregate, and I use with to simplify notation.

The result can then be fed to the data.frame.

data.frame(period=seq_len(length(phases))-1, phase=phases)
   period phase
1       0     A
2       1     A
3       2     A
4       3     A
5       4     A
6       5     A
7       6     A
8       7     A
9       8     A
10      9     A
11     10     A
12     11     A
13     12     A
14     13     A
15     14     A
16     15     A
17     16     B
18     17     B
19     18     B
20     19     B
21     20     B
22     21     B
23     22     B
24     23     B
25     24     B
26     25     B
27     26     B
28     27     B
29     28     B
30     29     C

add duplicate rows to R dataframe based on sequence

3 Answers3

Output: