2

I am struggling with a particular data manipulation issue. I posted recently on this, and was suggested the na.locf function in the zoo library, but this is a homework question and the TA specified not to load other libraries to solve this, so I need to use a more traditional data manipulation approach. We are given the following dataframe, which I show the first 10 rows of:

my_df[1:10,1:2]

     DATE FLIGHT
1  May 26     NA
2  Jun 10     NA
3  Jun 21     NA
4              1
5              2
6              3
7  Jun 23     NA
8              1
9              2
10 Jun 25     NA

I need to drag values down in the DATE column where there are blanks (rows 4-6 all Jun 21, rows 8-9 both Jun 23, etc.). I understand how na.locf makes this a one liner, a shame I cannot use it.

My idea is as follows:

  1. grab the dates that need to be dragged down using:

    repeat_dates = my_df$DATE[which(my_df$FLIGHT == 1)-1]
    
  2. create a vector with the number of times to repeat each date. here 3 corresponds to having to drag Jun 21 down 3 rows, 2 for dragging Jun 23 down 2 rows, etc.

    repeat_count = [3, 2, ...]
    
  3. use rep.int(repeat_dates, repeat_count) to get the vector:

    ["Jun 21", "Jun 21", "Jun 21", "Jun 23", "Jun 23", ...]
    
  4. and then update the values

I am not sure how to do part 2, if I can get this then I can get the problem. My TA suggested using the cumsum() function, possibly on a vector of booleans (1s / 0s where the DATE is blank?). I'm struggling with this, any help is appreciated!

EDIT - reproducable to help

dput(my_df[1:20,1:2])
structure(list(DATE = structure(c(-23961, -23946, -23935, NA, 
NA, NA, -23933, NA, NA, -23931, -23911, -23893, NA, NA, -23891, 
NA, NA, -23890, NA, NA), class = "Date"), FLIGHT = c(NA, NA, 
NA, 1L, 2L, 3L, NA, 1L, 2L, NA, NA, NA, 1L, 2L, NA, 1L, 2L, NA, 
1L, 2L)), .Names = c("DATE", "FLIGHT"), row.names = c(NA, 20L
), class = "data.frame") 
Canovice
  • 9,012
  • 22
  • 93
  • 211
  • You can just you use a for loop and an if statement when the date is blank to replace it with the previous value. – tatxif Feb 06 '17 at 20:52
  • 1
    Have a look at http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value – HubertL Feb 06 '17 at 20:53
  • 1
    Possible duplicate of [Replacing NAs with latest non-NA value](http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value) – HubertL Feb 06 '17 at 20:54
  • thanks i went with one of the solutions in here, appreciate it – Canovice Feb 06 '17 at 22:02
  • Even if you can't use `na.locf` itself, presumably it's ok to look at the source code for it. – G. Grothendieck Feb 07 '17 at 00:05

2 Answers2

1
for(i in 1:length(my_df$DATE)){
    if(my_df$DATE[i]==""){
    my_df$DATE[i] <- my_df$DATE[i-1]
    }
}
tatxif
  • 438
  • 6
  • 10
0

I went with a solution in the link shared:

replace_na_with_last<-function(x,a=!is.na(x)){
    x[which(a)[c(1,1:sum(a))][cumsum(a)+1]]
}

this i believe is the approach alluded to by the TA in my class, and works sufficiently, despite an initial confusion to understand the indexing going on.

Canovice
  • 9,012
  • 22
  • 93
  • 211