2

I am dealing with time series data where I need to have continuous time stamps but few of the data timestamp points has been missed while capturing like as below,

DF

ID Time_Stamp             A           B                 C
1  02/02/2018 07:45:00   123          567               434     
2   02/02/2018 07:45:01    
.....                  ...

5   02/02/2018 07:46:00   
6   02/02/2018 07:46:10    112          2323            2323

As shown in the sample df above, time stamps is continuous till row 5 but missed capturing data of 10 seconds between 5th and 6th row. My data frame is about 60000 rows and identifying missing values manually is tedious. Hence I was looking for automating the procedure of handling missing values using R

My result data frame should be as shown below,

ID Time_Stamp                     A          B          C
1  02/02/2018 07:45:00           123         567        434     
2   02/02/2018 07:45:01    
.....                  ...

5   02/02/2018 07:46:00         123         567         434
5.1  02/02/2018 07:46:01        123         567         434 
5.2  02/02/2018 07:46:02        123         567         434  
5.3  02/02/2018 07:46:03        123         567         434  
5.4  02/02/2018 07:46:04        123         567         434 
5.5  02/02/2018 07:46:05        123         567         434 
5.6  02/02/2018 07:46:06        123         567         434 
5.7  02/02/2018 07:46:07        123         567         434 
5.8  02/02/2018 07:46:08        123         567         434 
5.9  02/02/2018 07:46:09        123         567         434 
6   02/02/2018 07:46:10         112         2323        2323
6.1 02/02/2018 07:46:11         112         2323        2323 

i.e my last row before missing value . It is came values that for row number 5 and should fill same till next value is available.

I tried following code to add 0 as values to it. But could not do for previous row values of missing value

df1<- df %>%
  complete(Timestamp = seq(min(Timestamp), max(Timestamp), by = "sec")) %>%
  mutate_at(vars(A:C), ~replace(., is.na(.), 0 )) %>%
  mutate(ID = row_number())

I got output as follows:

ID Time_Stamp                     A        B        C
1  02/02/2018 07:45:00           123      567      434     
2   02/02/2018 07:45:01    
.....                  ...

5   02/02/2018 07:46:00         123       567      434
5.1  02/02/2018 07:46:01        0         0         0 
5.2  02/02/2018 07:46:02        0         0         0  
5.3  02/02/2018 07:46:03        0         0         0  
5.4  02/02/2018 07:46:04        0         0         0 
5.5  02/02/2018 07:46:05        0         0         0 
5.6  02/02/2018 07:46:06        0         0         0 
5.7  02/02/2018 07:46:07        0         0         0 
5.8  02/02/2018 07:46:08        0         0         0 
5.9  02/02/2018 07:46:09        0         0         0 
6   02/02/2018 07:46:10         112       2323      2323
6.1 02/02/2018 07:46:11         0         0         0 

I want previous row to be replaced with 0 value.

Thanks in advance

  • Not clear. Can you show what would be your expected output? – Ronak Shah Jul 04 '19 at 09:42
  • @RonakShah , Hi i want the previous row data value i.e `Row number 5 for all the missing values between 5 and 6`, And `row number 6 for value missing between 6 and 7`.Its like keeping previous value same for missing values i.e` i = i-1` value for all the missing rows data. –  Jul 04 '19 at 09:51
  • @RonakShah . I mean `0` value should be replace with 5th row values. –  Jul 04 '19 at 09:54
  • Do you need `df %>% complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>% fill(A, B, C) ` ? – Ronak Shah Jul 04 '19 at 09:55
  • @RonakShah last value of `A,B,C` –  Jul 04 '19 at 09:56
  • After completing the Time_Stamp the problem is similar to https://stackoverflow.com/questions/14655286/replace-missing-value-with-previous-value – Flavia Jul 04 '19 at 09:58
  • @RahulK what do you mean by last value A, B, C? Last non-NA value of `A`, `B` and `C` respectively right? Can you check the output given by `df %>% complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>% fill(A, B, C) ` what does it give? – Ronak Shah Jul 04 '19 at 10:02
  • @RonakShah `Error: df1_p[, 2:11] must evaluate to column positions or names, not a list` I have 11 variables in my original `DF` –  Jul 04 '19 at 10:05
  • 1
    Hmmm..not sure. Try this `df %>% complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>% fill(everything()) ` – Ronak Shah Jul 04 '19 at 10:10
  • @RonakShah This code worked for me perfectly. Thank You –  Jul 04 '19 at 10:33

2 Answers2

1

Using the xtx Package
The xts package has some functions for dealing with missing values. What you want is called LOCF (short for Last Observation Carried Forward). The xtx package has an na.locf function for this purpose. Please also note that you will have to convert your data to an xtx object beforehand.

my_xts <- xts(df[3:5], order.by=df[2])
newData <- na.locf(my_xts, na.rm = TRUE, fromLast = FALSE, maxgap=Inf)

Please note that since an xtx object is built on top of a matrix class, it expects all columns to be of the same type (i.e. in your case, all columns except the timestamp should be numeric. The timestamp goes into the index of xtx object). Also, note that Time_Stamp should be of type should be date/time.

More generally
Generally, for dealing with missing values, you can use the na.interp() function from the forecast package. na.interp() replaces missing values with interpolated values. And it can also interpolate seasonal data.
As quoted from its documentation;

By default, uses linear interpolation for non-seasonal series. For seasonal series, a robust STL decomposition is first computed. Then a linear interpolation is applied to the seasonally adjusted data, and the seasonal component is added back.

na.interp() can't deal with dataframes directly. You will have to convert your data into a time series object (ts or msts objects).

library(forecast)
my_msts <- df['A'] %>% 
 msts() %>%
 na.interp()

Using tidyr
The fill function from tidyr package makes this very simple. In that case, there's no need to convert the data to a time series object.
From tidyr documentation:

Description:
Fills missing values in selected columns using the previous entry. This is useful in the commonoutput format where values are not repeated, they’re recorded each time they change.
Usage:
fill(data, ..., .direction = c("down", "up"))

hnagaty
  • 796
  • 5
  • 13
0

The following code works perfectly

 df1<- df %>%
      complete(Timestamp = seq(min(Timestamp), max(Timestamp), by = "sec")) %>%
      fill(everything()) %>%
      mutate(ID = row_number())

It adds missing data with the previous or last value before the missing data time is started.