Construct a loop to compare two dataframes and fill the missing values

Question

if i have this data

C=
year month  day  hour minute    rain
2010    01   01    00     00   0.000
2011    01   01    00     00   0.000
2012    01   01    00     00   0.000
2013    01   01    00     00   0.000
2014    01   01    00     00   0.000
2015    01   01    00     15   0.000

and reference data like:

R=
year month  day  hour minute    rain
2013    01   01    00     00   0.000
2013    01   01    00     05   0.000
2013    01   01    00     10   0.000
2013    01   01    00     15   0.000
2013    01   01    00     20   0.000
2014    01   01    00     00   0.000
2014    01   01    00     05   0.000
2014    01   01    00     10   0.000
2014    01   01    00     15   0.000
2014    01   01    00     20   0.000
2015    01   01    00     00   0.000
2015    01   01    00     05   0.000
2015    01   01    00     10   0.000
2015    01   01    00     15   0.000
2015    01   01    00     20   0.000

I need to complete this M

M=
year month  day  hour minute    rain
2013    01   01    00     00   0.000
2013    01   01    00     05   0.000
2013    01   01    00     10   0.000
2013    01   01    00     15   0.000
2013    01   01    00     20   0.000
2014    01   01    00     00   0.000
2014    01   01    00     05   0.000
2014    01   01    00     10   0.000
2014    01   01    00     15   0.000
2014    01   01    00     20   0.000
2015    01   01    00     15   0.000
2015    01   01    00     20   0.000

as you see, 2015 in M starts at "2015 01 01 00 15 0.000" and we could use C info to make a nrow(c) loop and look for starting dates, the idea is complete this dataframe using a loop and match to fill year, month, day, hour and minute from reference data R, and fill empty rain column with "NaN". the final output would be:

F=
year month  day  hour minute    rain
2013    01   01    00     00   0.000
2013    01   01    00     05   0.000
2013    01   01    00     10   0.000
2013    01   01    00     15   0.000
2013    01   01    00     20   0.000
2014    01   01    00     00   0.000
2014    01   01    00     05   0.000
2014    01   01    00     10   0.000
2014    01   01    00     15   0.000
2014    01   01    00     20   0.000
2015    01   01    00     00   NaN
2015    01   01    00     05   NaN  
2015    01   01    00     10   NaN
2015    01   01    00     15   0.000
2015    01   01    00     20   0.000

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

To fill in the missing rows in M from the reference data R can be implemented as a right join using data.table for instance. So, no loops are required.

library(data.table)
setDT(M)[setDT(R)[, -"rain"], on = .(year, month, day, hour, minute)]

    year month day hour minute rain
 1: 2013     1   1    0      0    0
 2: 2013     1   1    0      5    0
 3: 2013     1   1    0     10    0
 4: 2013     1   1    0     15    0
 5: 2013     1   1    0     20    0
 6: 2014     1   1    0      0    0
 7: 2014     1   1    0      5    0
 8: 2014     1   1    0     10    0
 9: 2014     1   1    0     15    0
10: 2014     1   1    0     20    0
11: 2015     1   1    0      0   NA
12: 2015     1   1    0      5   NA
13: 2015     1   1    0     10   NA
14: 2015     1   1    0     15    0
15: 2015     1   1    0     20    0

Edit: Controlling the years

The OP has asked here and here how to control the years. As the code above implies a right join, all rows of R appear in the result set. So, R needs to filtered appropriately. This can be done by explicitely specifying one year

setDT(M)[setDT(R)[year == 2014L, -"rain"], on = .(year, month, day, hour, minute)]

   year month day hour minute rain
1: 2014     1   1    0      0    0
2: 2014     1   1    0      5    0
3: 2014     1   1    0     10    0
4: 2014     1   1    0     15    0
5: 2014     1   1    0     20    0

or a range of years

setDT(M)[setDT(R)[year %in% 2014:2018, -"rain"], on = .(year, month, day, hour, minute)]

    year month day hour minute rain
 1: 2014     1   1    0      0    0
 2: 2014     1   1    0      5    0
 3: 2014     1   1    0     10    0
 4: 2014     1   1    0     15    0
 5: 2014     1   1    0     20    0
 6: 2015     1   1    0      0   NA
 7: 2015     1   1    0      5   NA
 8: 2015     1   1    0     10   NA
 9: 2015     1   1    0     15    0
10: 2015     1   1    0     20    0

or by looking up the years contained in M

M[, unique(year)]

[1] 2013 2014 2015

setDT(M)[setDT(R)[year %in% M[, unique(year)], -"rain"], 
         on = .(year, month, day, hour, minute)]

    year month day hour minute rain
 1: 2013     1   1    0      0    0
 2: 2013     1   1    0      5    0
 3: 2013     1   1    0     10    0
 4: 2013     1   1    0     15    0
 5: 2013     1   1    0     20    0
 6: 2014     1   1    0      0    0
 7: 2014     1   1    0      5    0
 8: 2014     1   1    0     10    0
 9: 2014     1   1    0     15    0
10: 2014     1   1    0     20    0
11: 2015     1   1    0      0   NA
12: 2015     1   1    0      5   NA
13: 2015     1   1    0     10   NA
14: 2015     1   1    0     15    0
15: 2015     1   1    0     20    0

Data

R <- structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L
), month = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), day = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), minute = c(0L, 5L, 10L, 15L, 20L, 
0L, 5L, 10L, 15L, 20L, 0L, 5L, 10L, 15L, 20L), rain = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("year", "month", 
"day", "hour", "minute", "rain"), row.names = c(NA, -15L), class = "data.frame")

M <- structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2015L, 2015L), month = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), day = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), hour = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), minute = c(0L, 5L, 10L, 15L, 
20L, 0L, 5L, 10L, 15L, 20L, 15L, 20L), rain = c(0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0)), .Names = c("year", "month", "day", "hour", 
"minute", "rain"), row.names = c(NA, -12L), class = "data.frame")

Thanks! I have a question, if reference data have more rows than the data I want to complete.. is it possible to modify the code in order to replace only from the year of the data we want to complete.. not the first year of reference data. This in order to use one reference data for several data to complete. thanks in advance! — Felipe Rincón, Jan 18 '18 at 03:25
We are using a *right join* which means that the result contains *all* rows of `R`. So, we need to filter `R` appropriately. This can be done manually, e.g., `...[setDT(R)[year == 2014L, -"rain"], ...` or by looking up the year(s) existing in `M`. — Uwe, Jan 18 '18 at 04:03
Thanks Uwe, how could I set this: [setDT(R)[year == 2014L, -"rain"] to be from 2014 until 2018? — Felipe Rincón, Jan 18 '18 at 04:41

Construct a loop to compare two dataframes and fill the missing values

1 Answers1

Edit: Controlling the years

Data