0

I would like to impute missing values using the last observation carried forward(locf) or the next observation carried backward(nocb) in two or more gaps.

In order to determine the direction (top/down) to fill the missing values, the first column (Interval) in the data frame is weighted. Interpolation (locf/nocb) should be from the smallest-value end(Interval values that correspond to NAs rows outward). The default fillna is locf.

Example: logic for exceptional nafill : in ID2 column, row 5-7 (header included), the direction will be nocb because (Interval[7,] value= 50) which is less than Interval[5,] value=100.

An attempt using df1 <-df %>%fill(ID1, ID2, ID3,which.min(Interval)) %>% #default direction down fill(ID1, ID2, ID3, .direction = "up",which.min(Interval)) aint working, please assist to correct it or suggest another efficient R solution. Thank you in advance for your help.

Input

df = data.frame(
             Interval = c(0,20,80,100,50,50,130,100,70,80,200), 
             ID1 = c(0,1,1,0,NA,NA,NA,NA,1,NA,1), 
             ID2 = c(1,0,0,NA,NA,NA,1,1,3,NA,1), 
             ID3 = c(1,NA,1,0,3,NA,NA,NA,1,NA,1) 
)

Interval    ID1 ID2 ID3
0   0   1   1
20  1   0   NA
80  1   0   1
100 0   NA  0
50  NA  NA  3
50  NA  NA  NA
130 NA  1   NA
100 NA  1   NA
70  1   3   1
80  NA  NA  NA
200 1   1   1

Expected output

Interval    ID1 ID2 ID3
0   0   1   1
20  1   0   NA
80  1   0   1
100 0   1   0
50  0   1   3
50  0   1   3
130 0   1   3
100 0   1   3
70  1   3   1
80  NA  NA  NA
200 1   1   1

Glen Viet
  • 21
  • 6
  • I am a little unclear about how exactly you wish to use the Interval column to determine the direction. Would it be possible to expand on the second paragraph a little? – hammoire Apr 13 '20 at 06:51
  • @Glen I guess you are correct upto some extent. Check with this df %>% fill(ID1, ID2, ID3, .direction = "downup") . Also, you can check the following [link](https://stackoverflow.com/questions/23340150/replace-missing-values-na-with-most-recent-non-na-by-group) – nikn8 Apr 13 '20 at 06:55
  • The interval column represents a physical location, so I want to impute to a closer location! The interval is not time series per se. The reason for my initial attempt to add the parameter - which.min(Interval). The NAs in ID2 column, row 5-7 should be filled nocb for example.@hammoire, hope my explanation clearer now, thanks. @Neel Kamal, thank you for the link and correction, not solution however. – Glen Viet Apr 13 '20 at 11:15
  • Short question out of curiosity: What is the reasoning behind filling only gaps larger than 1 NA in a row? The R package imputeTS has a locf function with a maxgap option. Which then only fill gaps smaller then a certain threshold of consecutive NAs. The reasoning there is, that the larger the gap the more inaccurate locf will be. But what is the reasoning of excluding imputation of smaller gaps? – Steffen Moritz Apr 14 '20 at 03:47
  • Thanks for your observations@stats0007. To give a background, I wanna impute markers to the values of the closer skeleton markers based on physical location (Interval). In the df, for example, ID2 column, Interval values 50 and 100 are both corresponding to NA entries, thus the smallest in this case 50 will give fillna direction, i.e nocb. I will later handle single NA entries using default locf. – Glen Viet Apr 14 '20 at 04:44

1 Answers1

0

It is not clear what your first column is for. Your second expected column is a mix of locf and nocb, however you can use standard nafill column by column:

library(data.table)
apply(df, 2, function(j) {nafill(j, type = "locf")})

According to your comment, if you like to make nafilldependent on df$Intervalyou can index your df using:

StartIndex <- apply(df[,2:dim(df)[2]], 2, function(j) {
  which(diff(is.na(j)) == 1 ) + 1})
EndIndex <- apply(df[,2:dim(df)[2]], 2, function(j) {
  which(diff(is.na(j)) == -1) })

And indicating if start and end of df$Intervalis increasing or decreasing:

lapply(1:(dim(df)[2]-1), function(i) {
  lapply(1:length(StartIndex[[i]]), function(j) {
    range <- c(StartIndex[[i]][j], EndIndex[[i]][j])
    if (range[1] != range[2]) {
      df$Interval[range[1]] - df$Interval[range[2]]
    }
  })
})
Cevior
  • 97
  • 8
  • Thank you @fabian, interval column represents a physical location, so I want to impute to a closer location! The intervals are not time series per se. The NAs in ID2 column, row 5-8 should be filled nocb for example as you have pointed out. – Glen Viet Apr 13 '20 at 09:38
  • You mean ID1 rows 5-8 (since ID2 has NAs from 4 to 6 and 10)? So what's the logic behind the nafill wanted? – Cevior Apr 13 '20 at 10:04
  • My bad mixup, sorry@ fabian. I have added the logic behind nafill to the main question, please check!. Basically, the corresponding value in the Interval column determines direction whether locf or nocb. The reason for my initial attempt to add the parameter - which.min(Interval) – Glen Viet Apr 13 '20 at 11:07
  • Ok, however, the values 50 and 100 are both corresponding to NA entries. What's the logic behind a single NA entry then? – Cevior Apr 13 '20 at 13:20
  • I bet you are right @fabian, since I was mainly interested in nafill of big gaps to a closer physical location, maybe ignoring single NA entry will be ideal! Please bear with me my programming reasoning. I will update the main question. – Glen Viet Apr 13 '20 at 14:42
  • Besides, single entries can later be filled using your above data.table locf approach, thanks – Glen Viet Apr 14 '20 at 01:42