Add rows based on numerical sequence

Question

I am trying to add rows to a dataframe based on missing values in a numerical sequence.

Here's a reprex. I want to go from this:

> df[-c(1,3,9),]
   id year          V1          V2         V3
2   1 2019  0.84788413  0.10418523  0.2249371
4   2 2018  0.73183889  0.66380165  0.7681833
5   2 2019  0.38263072 -0.66741116 -0.1803099
6   2 2020 -0.05915745  2.09814096  0.8558323
7   3 2018  1.42148474 -1.65590355 -0.0879526
8   3 2019  1.46178632  1.96796970 -0.3489630
10  4 2018  0.12511779 -0.91978526 -2.3880951
11  4 2019  0.93936831 -0.24440871  0.3249178
12  4 2020 -1.57864369 -0.05853787  0.6078194

To this:

   id year          V1          V2         V3
1   1 2018          NA          NA         NA
2   1 2019  0.84788413  0.10418523  0.2249371
3   1 2020          NA          NA         NA
4   2 2018  0.73183889  0.66380165  0.7681833
5   2 2019  0.38263072 -0.66741116 -0.1803099
6   2 2020 -0.05915745  2.09814096  0.8558323
7   3 2018  1.42148474 -1.65590355 -0.0879526
8   3 2019  1.46178632  1.96796970 -0.3489630
9   3 2020          NA          NA         NA
10  4 2018  0.12511779 -0.91978526 -2.3880951
11  4 2019  0.93936831 -0.24440871  0.3249178
12  4 2020 -1.57864369 -0.05853787  0.6078194

The logic is to add the missing year rows and add NAs to the rest of the columns.

Data:

structure(list(id = c(1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), year = c(2019L, 
2018L, 2019L, 2020L, 2018L, 2019L, 2018L, 2019L, 2020L), V1 = c(0.847884128902485, 
0.731838887436047, 0.382630718058478, -0.0591574520333011, 1.42148473746568, 
1.46178631522088, 0.125117791300285, 0.939368308197552, -1.57864368576782
), V2 = c(0.104185228129027, 0.663801650973095, -0.667411160654917, 
2.09814095835567, -1.65590354896798, 1.96796970263568, -0.919785264321656, 
-0.244408708889214, -0.0585378742959754), V3 = c(0.224937129454626, 
0.7681832776488, -0.180309905647701, 0.855832252932298, -0.0879525996394009, 
-0.34896299605019, -2.38809514212219, 0.324917787941616, 0.607819444746004
)), row.names = c(2L, 4L, 5L, 6L, 7L, 8L, 10L, 11L, 12L), class = "data.frame")

score 1 · Accepted Answer · answered Jan 06 '21 at 08:26

You can use tidyr's complete

tidyr::complete(df, id, year)

#      id  year      V1      V2      V3
#   <int> <int>   <dbl>   <dbl>   <dbl>
# 1     1  2018 NA      NA      NA     
# 2     1  2019  0.848   0.104   0.225 
# 3     1  2020 NA      NA      NA     
# 4     2  2018  0.732   0.664   0.768 
# 5     2  2019  0.383  -0.667  -0.180 
# 6     2  2020 -0.0592  2.10    0.856 
# 7     3  2018  1.42   -1.66   -0.0880
# 8     3  2019  1.46    1.97   -0.349 
# 9     3  2020 NA      NA      NA     
#10     4  2018  0.125  -0.920  -2.39  
#11     4  2019  0.939  -0.244   0.325 
#12     4  2020 -1.58   -0.0585  0.608

score 0 · Answer 2 · answered Jan 06 '21 at 08:56

In base R you could use expand.grid() + merge():

merge(df, expand.grid(id = unique(df$id), year = unique(df$year)), all = TRUE)
#    id year          V1          V2         V3
# 1   1 2018          NA          NA         NA
# 2   1 2019  0.84788413  0.10418523  0.2249371
# 3   1 2020          NA          NA         NA
# 4   2 2018  0.73183889  0.66380165  0.7681833
# 5   2 2019  0.38263072 -0.66741116 -0.1803099
# 6   2 2020 -0.05915745  2.09814096  0.8558323
# 7   3 2018  1.42148474 -1.65590355 -0.0879526
# 8   3 2019  1.46178632  1.96796970 -0.3489630
# 9   3 2020          NA          NA         NA
# 10  4 2018  0.12511779 -0.91978526 -2.3880951
# 11  4 2019  0.93936831 -0.24440871  0.3249178
# 12  4 2020 -1.57864369 -0.05853787  0.6078194

Add rows based on numerical sequence

2 Answers2