1

I am working on solving another problem here: Fill in time series gaps with both LCOF and NOCB methods but acknowledge breaks in time series and it raised a new issue.

I have a time series dataset with missing data (NA).

    > str(final_daily_intake2)
    'data.frame':   387 obs. of  11 variables:
     $ Date             : chr  "2014-08-13" "2014-08-14" "2014-08-15" "2014-08-16" ...
     $ MEID.1           : chr  NA NA NA "14" ...
     $ MEID.2           : Factor w/ 184 levels "1","100","100.1",..: NA NA NA 143 48 NA NA NA NA NA ...
     $ MEID.3           : Factor w/ 180 levels "100","100.1",..: NA NA NA 24 134 NA NA NA NA NA ...
     $ MEID.4           : Factor w/ 42 levels "173","173a","173b",..: NA NA NA 17 1 NA NA NA NA NA ...
     $ MEID.5           : Factor w/ 3 levels "d1","s1","s2": NA NA NA 2 3 NA NA NA NA NA ...
     $ MEID.6           : Factor w/ 1 level "s2": NA NA NA NA NA NA NA NA NA NA ...
     $ DAYT             : int  NA NA NA 1 8 NA NA NA NA NA ...
     $ DATT             : int  NA NA NA 1 1 NA NA NA NA NA ...
     $ Reason.For.Change: chr  "0" "0" "0" "0" ...
     $ GAP_Days         : chr  "1" "2" "3" "NA" ...

And head() on the data gives

       Date MEID.1 MEID.2 MEID.3 MEID.4 MEID.5 MEID.6 DAYT DATT Reason.For.Change GAP_Days
1 2014-08-13   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        1
2 2014-08-14   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        2
3 2014-08-15   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        3
4 2014-08-16     14     61    117    187     s1   <NA>    1    1                 0       NA
5 2014-08-17    14a    193     56    173     s2   <NA>    8    1                 0       NA
6 2014-08-18   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        1

I have labeled consecutive days in the missing data time periods in column GAP_Days.

I would like to create a new column in the dataframe that labels the missing data time period. For example, this dataset has 9 missing time periods. The new column GAP_LABEL would simply start at 1 and end at 9. For example, in the first missing time period in this data set, the GAP_LABEL value would be 1 for the first three dates, since they all belong to the same missing data time period.

Anticipated outcome

    Date MEID.1 MEID.2 MEID.3 MEID.4 MEID.5 MEID.6 DAYT DATT Reason.For.Change GAP_Days GAP_LABEL
1 2014-08-13   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        1          1
2 2014-08-14   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        2          1
3 2014-08-15   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        3          1
4 2014-08-16     14     61    117    187     s1   <NA>    1    1                 0       NA
5 2014-08-17    14a    193     56    173     s2   <NA>    8    1                 0       NA
6 2014-08-18   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   NA   NA                 0        1          2

I have not seen other examples that demonstrate how this would be done. If anyone has suggestions, that would be great.

Community
  • 1
  • 1
user3795577
  • 187
  • 1
  • 18

1 Answers1

0

How about something like this:

final_daily_intake2 <- within(final_daily_intake2, {
  GAP_LABEL <- GAP_Days
  GAP_LABEL[!is.na(GAP_Days)] <- cumsum(GAP_Days[!is.na(GAP_Days)]==1)
})

The key here is that cumsum() gives the cumulative sum of the vector. By using cumsum() on GAP_Days==1 it allows you to increment by 1 every time you start a new count.

Sam Dickson
  • 5,082
  • 1
  • 27
  • 45
  • My problem is that GAP_LABEL does not exist and I want to create it. (see edit above to include my anticipated outcome). – user3795577 Nov 19 '15 at 19:35
  • That was careless. I had meant to use `GAP_Days` to make the calculation. That has been updated. – Sam Dickson Nov 19 '15 at 19:38
  • Thank you. This is very close. This appropriately labels the GAP_Days within the same time period as the same value, but it also labels the subsequent NA days, which are non-missing and therefore not part of the GAP. For example, the first two non-NA rows in the dataset also get a GAP_LABEL of 1. I will go ahead and accept this answer because it got me to a place where I can just create a conditional column that accepts values from GAP_LABEL when data are NA in other columns. Thank you so much for your help. – user3795577 Nov 19 '15 at 19:47
  • Okay. Using `!is.na(GAP_Days)` should make it so that it leaves the non-NA values alone (they should end up being `NA`in `GAP_LABEL`). I wonder what went wrong. – Sam Dickson Nov 19 '15 at 19:52