I'm interested in expanding a data frame with missing values across any number of columns for the periods where data is missing following the data units.
Example
The problem can be easily illustrated on with use of a simple example.
Data
The generated data contains some time series observations and dates missing on random.
# Data generation
# Seed
set.seed(1)
# Size
sizeDf <- 10
# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)
# Delete rows
dta <-
dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]
Preview
>> dta
dates varA varB varC
1 2016-07-28 0.26550866 0.2059746 0.93470523
2 2016-07-29 0.37212390 0.1765568 0.21214252
3 2016-07-30 0.57285336 0.6870228 0.65167377
4 2016-07-31 0.90820779 0.3841037 0.12555510
7 2016-08-03 0.94467527 0.7176185 0.01339033
8 2016-08-04 0.66079779 0.9919061 0.38238796
9 2016-08-05 0.62911404 0.3800352 0.86969085
10 2016-08-06 0.06178627 0.7774452 0.34034900
Key characteristics
From the perspective of the proposed analysis, the key characteristics are:
- The date units, days in that case
- Randomly missing dates
Missing dates
seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
)[!(seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
) %in% dta$dates)]
"2016-08-01" "2016-08-02"
Desired results
The newly created data frame should look like that:
>> dtaNew
dates varA varB varC
1 2016-07-28 0.3337749 0.32535215 0.8762692
2 2016-07-29 0.4763512 0.75708715 0.7789147
3 2016-07-30 0.8921983 0.20269226 0.7973088
4 2016-07-31 0.8643395 0.71112122 0.4552745
5 2016-08-01 NA NA NA
6 2016-08-02 NA NA NA
7 2016-08-03 0.9606180 0.14330438 0.6049333
8 2016-08-04 0.4346595 0.23962942 0.6547239
9 2016-08-05 0.7125147 0.05893438 0.3531973
10 2016-08-06 0.3999944 0.64228826 0.2702601
This simply obtained with use of:
dtaNew[dtaNew$dates %in% missDates, 2:4] <- NA
where the missDates
is taken from the previous seq
.
Attempts
Creating vector with all the dates is simple:
allDates <- seq(from = min(dta$dates), to = max(dta$dates), by = 1)
but obviously I cannot just push it to the data frame:
>> dta$allDates <- allDates
Error in `$<-.data.frame`(`*tmp*`, "allDates", value = c(17010, 17011, :
replacement has 10 rows, data has 8
The possible solution could use the loop that would push the row with NA
values to the data frame row by row for each of the dates identified as missing but this is grossly inefficient and messy.
To sum up, I'm interested in achieving the following:
- Expanding the data frame with all the dates following the same unit. I.e. for missing daily data days are added, for missing quarterly data quarters are added.
- I would like to then push the NA values across all the columns in the data frame for where the missing date was found