0

I have a set of 10 years (2009-2020), 30-min interval meteorological datasets, but the data has missing values during night (~17:00 to ~08:00 next day) for 2 two years (2015-2017) due to battery failure of the instrument. Variables are: air temperature, humidity, wind, radiations.

  1. Does anyone have any good idea to fill those values? Or
  2. If I want to fill them using mean of each variable for same time step (rest of the years), how to do that? Any idea please?

I tried to do using fill_by_function(fun = mean) of padR, but it uses mean of all datasets and a single value.

Below I have put a slice example of my data frame structure:

chhota_1 <- data %>%
  slice(89845:89900) %>%
  pad() %>%
  fill_by_value(na.pad=TRUE)

Structure of the data, a look

Limey
  • 10,234
  • 2
  • 12
  • 32
  • It would be easier to help if you create a small reproducible example along with expected output. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Mar 01 '21 at 12:29
  • Hi Ronak, Thanks for the comment. Yes, I am new in R as well as in this forum. Yes for a better example I can share a part of the datasets, that would help I think. – Arindan Mandal Mar 03 '21 at 04:57

1 Answers1

0

You could accomplish it via data.table (among many other methods):

library(data.table)

data = data.table(
  "timestep" = rep(1:5, 4),
  "year" = rep(2010:2013, each = 5),
  "value" = rnorm(20)
)

# create NA year
data[year == 2012, value := NA_real_]
> data
    timestep year      value
 1:        1 2010  0.7383885
 2:        2 2010  0.6547628
 3:        3 2010 -0.9825121
 4:        4 2010  0.2670737
 5:        5 2010  0.8688922
 6:        1 2011  0.1509628
 7:        2 2011  1.2482106
 8:        3 2011 -0.9492589
 9:        4 2011  1.2311409
10:        5 2011 -0.1162351
11:        1 2012         NA
12:        2 2012         NA
13:        3 2012         NA
14:        4 2012         NA
15:        5 2012         NA
16:        1 2013 -1.0179958
17:        2 2013  0.4368148
18:        3 2013  0.7547140
19:        4 2013 -0.4759922
20:        5 2013 -0.2393624

fill NA by mean for that timestep (over other years):

data[, value := ifelse(is.na(value), mean(value, na.rm = TRUE), value), by = "timestep"]
> data
    timestep year       value
 1:        1 2010  0.73838849
 2:        2 2010  0.65476283
 3:        3 2010 -0.98251205
 4:        4 2010  0.26707371
 5:        5 2010  0.86889218
 6:        1 2011  0.15096278
 7:        2 2011  1.24821056
 8:        3 2011 -0.94925891
 9:        4 2011  1.23114088
10:        5 2011 -0.11623511
11:        1 2012 -0.04288152
12:        2 2012  0.77992939
13:        3 2012 -0.39235232
14:        4 2012  0.34074078
15:        5 2012  0.17109823
16:        1 2013 -1.01799582
17:        2 2013  0.43681478
18:        3 2013  0.75471400
19:        4 2013 -0.47599225
20:        5 2013 -0.23936237

EDIT:

From the sample data provided and comments:

> library(lubridate)
> library(data.table)
> 
> data = fread("~/Downloads/test_data.csv")
> 
> data[, timestamp := as.POSIXct(fast_strptime(timestamp, format = "%m/%d/%Y %H:%M"))]
> 
> data[, date := format(timestamp, "%m-%d")]
> data[, year := format(timestamp, "%Y")]
> 
> data[is.na(value), ]
                 timestamp value     time  date year
    1: 2012-06-22 11:00:00    NA 11:00:00 06-22 2012
    2: 2015-02-22 18:00:00    NA 18:00:00 02-22 2015
    3: 2015-02-22 18:30:00    NA 18:30:00 02-22 2015
    4: 2015-02-22 19:00:00    NA 19:00:00 02-22 2015
    5: 2015-02-22 19:30:00    NA 19:30:00 02-22 2015
   ---                                              
16194: 2017-04-07 05:30:00    NA  5:30:00 04-07 2017
16195: 2017-04-07 06:30:00    NA  6:30:00 04-07 2017
16196: 2017-04-07 18:30:00    NA 18:30:00 04-07 2017
16197: 2017-04-07 23:00:00    NA 23:00:00 04-07 2017
16198: 2017-04-08 19:00:00    NA 19:00:00 04-08 2017
> 
> data[, value := ifelse(is.na(value), mean(value, na.rm = TRUE), value), by = c("time", "date")]
> data[is.na(value), ]
Empty data.table (0 rows and 5 cols): timestamp,value,time,date,year
Jav
  • 2,203
  • 13
  • 22
  • Hi Jav, Thank you very much for the reply and a possible solution. However, I tried this way. But this gives the mean of time steps for all the days in all years. This makes the output unrealistic as this is meteorological data and that has seasonality. So I was thinking how to do the same way, but the mean will be for that particular day for particular time steps. As I have 10 years data, 2 years missing night values, then I have rest 8 years datasets, so mean of all 8 values to be used for filling the gaps. Can we do that or a simple way? I hope I could clear you my question. – Arindan Mandal Mar 03 '21 at 04:54
  • Hi Jav, a part of the entire datasets in csv is attached: https://easyupload.io/g8q0tv Have a look or in case you would like to give it try. Thank you. – Arindan Mandal Mar 03 '21 at 05:02
  • @ArindanMandal replace `by = "timestep"` with `by = c("date", "timestep")` – Jav Mar 04 '21 at 06:30
  • Hi Jav, this one is great and solved my requirements. Thank you very much for the nice stuff. – Arindan Mandal Mar 08 '21 at 09:23