0

I'm interested in expanding a data frame with missing values across any number of columns for the periods where data is missing following the data units.

Example

The problem can be easily illustrated on with use of a simple example.

Data

The generated data contains some time series observations and dates missing on random.

# Data generation

# Seed
set.seed(1)

# Size
sizeDf <- 10

# Populate data frame
dta <- data.frame(
    dates = seq(
        from = Sys.Date() - (sizeDf - 1),
        to = Sys.Date(),
        by = 1
    ),
    varA = runif(n = sizeDf),
    varB = runif(n = sizeDf),
    varC = runif(n = sizeDf)
)

# Delete rows
dta <-
    dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]

Preview

>> dta
        dates       varA      varB       varC
1  2016-07-28 0.26550866 0.2059746 0.93470523
2  2016-07-29 0.37212390 0.1765568 0.21214252
3  2016-07-30 0.57285336 0.6870228 0.65167377
4  2016-07-31 0.90820779 0.3841037 0.12555510
7  2016-08-03 0.94467527 0.7176185 0.01339033
8  2016-08-04 0.66079779 0.9919061 0.38238796
9  2016-08-05 0.62911404 0.3800352 0.86969085
10 2016-08-06 0.06178627 0.7774452 0.34034900

Key characteristics

From the perspective of the proposed analysis, the key characteristics are:

  • The date units, days in that case
  • Randomly missing dates

Missing dates

seq(
    from = Sys.Date() - (sizeDf - 1),
    to = Sys.Date(),
    by = 1
)[!(seq(
    from = Sys.Date() - (sizeDf - 1),
    to = Sys.Date(),
    by = 1
) %in% dta$dates)]

"2016-08-01" "2016-08-02"

Desired results

The newly created data frame should look like that:

>> dtaNew
        dates      varA       varB      varC
1  2016-07-28 0.3337749 0.32535215 0.8762692
2  2016-07-29 0.4763512 0.75708715 0.7789147
3  2016-07-30 0.8921983 0.20269226 0.7973088
4  2016-07-31 0.8643395 0.71112122 0.4552745
5  2016-08-01        NA         NA        NA
6  2016-08-02        NA         NA        NA
7  2016-08-03 0.9606180 0.14330438 0.6049333
8  2016-08-04 0.4346595 0.23962942 0.6547239
9  2016-08-05 0.7125147 0.05893438 0.3531973
10 2016-08-06 0.3999944 0.64228826 0.2702601

This simply obtained with use of:

dtaNew[dtaNew$dates %in% missDates, 2:4] <- NA

where the missDates is taken from the previous seq.


Attempts

Creating vector with all the dates is simple:

allDates <- seq(from = min(dta$dates), to = max(dta$dates), by = 1)

but obviously I cannot just push it to the data frame:

>> dta$allDates  <- allDates
Error in `$<-.data.frame`(`*tmp*`, "allDates", value = c(17010, 17011,  : 
  replacement has 10 rows, data has 8

The possible solution could use the loop that would push the row with NA values to the data frame row by row for each of the dates identified as missing but this is grossly inefficient and messy.


To sum up, I'm interested in achieving the following:

  1. Expanding the data frame with all the dates following the same unit. I.e. for missing daily data days are added, for missing quarterly data quarters are added.
  2. I would like to then push the NA values across all the columns in the data frame for where the missing date was found
Pj_
  • 824
  • 6
  • 15
Konrad
  • 17,740
  • 16
  • 106
  • 167
  • If there are no repeated dates you can use `merge`. – Ernest A Aug 06 '16 at 09:56
  • You are trying to column bind two dataframes with variable row numbers, which is why it fails. Try the solutions mentioned here http://stackoverflow.com/questions/19074163/cbind-is-there-a-way-to-have-missing-values-set-to-na – Frash Aug 06 '16 at 09:58
  • Also suggest you clear the verbose description, took me a long time to understand what you were speaking about. – Frash Aug 06 '16 at 09:59

1 Answers1

0

If I understand your question, you can use rbind.fill from the plyr package to get your desired output:

sizeDf <- 10
# Populate data frame
dta <- data.frame(
  dates = seq(
    from = Sys.Date() - (sizeDf - 1),
    to = Sys.Date(),
    by = 1
  ),
  varA = runif(n = sizeDf),
  varB = runif(n = sizeDf),
  varC = runif(n = sizeDf)
)

# Delete rows
dta <-dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]

#Get missing dates
missing_dates <- seq(from=min(dta$dates), to=max(dta$dates), by=1)[!(seq(from=min(dta$dates), to=max(dta$dates), by=1) %in% dta$dates)]

#Create the new dataset by using plyr's rbind.fill function
dta_new <- plyr::rbind.fill(dta,data.frame(dates=missing_dates))

#Order the data by the dates column
dta_new <- dta_new[order(dta_new$dates),]

#Print it
print(dta_new, row.names = F, right = F)


 dates      varA        varB      varC      
 2016-07-28 0.837859418 0.2966637 0.61245244
 2016-07-29 0.144884547 0.9284294 0.11033990
 2016-07-30          NA        NA         NA
 2016-07-31          NA        NA         NA
 2016-08-01 0.003167049 0.9096805 0.29239470
 2016-08-02 0.574859760 0.1466993 0.69541969
 2016-08-03          NA        NA         NA
 2016-08-04 0.748639215 0.9602836 0.67681826
 2016-08-05 0.983939562 0.4867804 0.35270309
 2016-08-06 0.383366957 0.2241982 0.09244522

I hope this helps.

Abdou
  • 12,931
  • 4
  • 39
  • 42