How to replace NA data of specific dates with the average data of different years of same dates of a dataframe in R?

Question

If I have a dataframe A

A = 
year     month     day     hour     minute     rain
                        .
                        .
                        .
2000      01        01      01       00          2
2000      01        01      01       15          2
2000      01        01      01       30          NA
2000      01        01      01       45          3
2000      01        01      02       00          4
2000      01        01      02       15          5
                        .
                        .
                        .

Dataframe A have data from 1990 to 2000 with a frequency data of 15 minutes.

so A have the same dates and different years of specific missing data (NA):

A = 
year     month     day     hour     minute     rain
1990      01        01      01       30          10
                        .
                        .
                        .
1991      01        01      01       30          21
                        .
                        .
                        .
1992      01        01      01       30          4
                        .
                        .
                        .
1993      01        01      01       30          6
                        .
                        .
                        .
1994      01        01      01       30          10
                        .
                        .
                        .
1995      01        01      01       30          23
                        .
                        .
                        .
1996      01        01      01       30          0
                        .
                        .
                        .
1997      01        01      01       30          0
                        .
                        .
                        .
1998      01        01      01       30          0
                        .
                        .
                        .
1999      01        01      01       30          6
                        .
                        .
                        .
2000      01        01      01       30          NA

The idea is to look for every NA data in Rain column and replace them with the average of the same date of each year from 1990 to 2000.

The new A would be for 2000 year:

A = 
year     month     day     hour     minute     rain
                        .
                        .
                        .
2000      01        01      01       00          2
2000      01        01      01       15          2
2000      01        01      01       30          **8**
2000      01        01      01       45          3
2000      01        01      02       00          4
2000      01        01      02       15          5
                        .
                        .
                        .

It is better if you can provide a reproducible example using `dput`, See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — kangaroo_cliff, Jan 18 '18 at 03:27
It may not be the prettiest way to do it, but you could create a new data frame from B that just has a key (month-day) and an average of rain. Then merge that onto A and just do a ifelse(is.na(rain), average_rain, rain). — Pawel, Jan 18 '18 at 03:28

kangaroo_cliff · Accepted Answer · 2018-01-18T04:11:49.297

2

This might work. It groups the months and days pairs and the replace the NAs from the mean.

library(dplyr)
A <- A %>% 
       group_by(month, day, hour, minute) %>% 
       mutate(rain = ifelse(is.na(rain), 
                              mean(rain, na.rm=TRUE), rain))

edited Jan 18 '18 at 04:11

answered Jan 18 '18 at 03:29

kangaroo_cliff

6,067
3
29
42

headpoint, could you give me an advise about how to solve this? https://stackoverflow.com/questions/48835464/how-to-manage-a-table-to-obtain-information-using-conditions – Felipe Rincón Feb 17 '18 at 14:08
Some tips would be very helpful! thanks for your answer. By the way, I've been trying to use simple conditionals in R, and logical indexes in matlab. What do you think could be the best way to solve it? any idea is gold! – Felipe Rincón Feb 17 '18 at 14:28
For that question, I think I'd first select the rows where a > 0, and then try with questions like `cut` or `findInterval` on the two columns to see if I can get the required output easily. – kangaroo_cliff Feb 17 '18 at 14:40
Does _findInterval_ allows us to obtain the number of times we can find all the entries of a column on ranges that variate from 0-5;0-10;0-15..until 0-35? Thanks for your help – Felipe Rincón Feb 17 '18 at 15:16

www · Answer 2 · 2018-01-18T16:23:23.280

A solution using dplyr. The idea is to summarize the rain in A by month, day, hour, and minute, join by these columns, and then replace NA in rain with the average rain value.

Notice that after the OP updated and clarified the original question, headpoint's answer is more simple and straightforward (https://stackoverflow.com/a/48313380/7669809). However, I still keep my answer here as an example to show that if the replacement values are from another data frame, the left_join approach would be useful.

library(dplyr)

A2 <- A %>%
  left_join(A %>%
              group_by(month, day, hour, minute) %>%
              summarise(mean_rain = mean(rain, na.rm = TRUE)), 
            by = c("month", "day", "hour", "minute")) %>%
  mutate(rain = ifelse(is.na(rain), mean_rain, rain)) %>%
  select(-mean_rain)
A2
#    year month day hour minute rain
# 1  1990     1   1    1     30   10
# 2  1991     1   1    1     30   21
# 3  1992     1   1    1     30    4
# 4  1993     1   1    1     30    6
# 5  1994     1   1    1     30   10
# 6  1995     1   1    1     30   23
# 7  1996     1   1    1     30    0
# 8  1997     1   1    1     30    0
# 9  1998     1   1    1     30    0
# 10 1999     1   1    1     30    6
# 11 2000     1   1    1      0    2
# 12 2000     1   1    1     15    2
# 13 2000     1   1    1     30    8
# 14 2000     1   1    1     45    3
# 15 2000     1   1    2      0    4
# 16 2000     1   1    2     15    5

DATA

A <- read.table(text =  "year     month     day     hour     minute     rain
1990      01        01      01       30          10
1991      01        01      01       30          21
1992      01        01      01       30          4
1993      01        01      01       30          6
1994      01        01      01       30          10
1995      01        01      01       30          23
1996      01        01      01       30          0
1997      01        01      01       30          0
1998      01        01      01       30          0
1999      01        01      01       30          6
2000      01        01      01       00          2
2000      01        01      01       15          2
2000      01        01      01       30          NA
2000      01        01      01       45          3
2000      01        01      02       00          4
2000      01        01      02       15          5",
                header = TRUE, stringsAsFactors = FALSE)

I didn't realize the OP was talking about two data sets since he only mentioned A in text. — kangaroo_cliff, Jan 18 '18 at 03:43
@headpoint I think that is a typo. OP should have mentioned B. — www, Jan 18 '18 at 03:45
@headpoint I have updated my post in accordance with the OP's update. I believe with the OP's update your solution is better than mine as it is simple and straightforward. But I still keep my `left_join` approach here to show how to replace values if the replacement values are from another data frame. — www, Jan 18 '18 at 16:12

Hlynur · Answer 3 · 2018-01-18T05:01:15.347

Here's a dplyr method that uses only the means from the same time from previous years. That is, as opposed to solutions provided by headpoint and www, this will not use values for 01:30, January 1st 2001 or 2002 (etc) to calculate the mean used for 01:30, January 1st 2000, but rather the mean rain value on 01:30, January 1st for all the years before 2000.

So first, we simply recreate the A dataframe for which to work with. I only load the tibble package to use the tribble function to recreate your dataframe example.

library(dplyr)
library(tibble)

A <- tribble(
~ year, ~month, ~day, ~hour, ~minute, ~rain,
1990,   01,     01,   01,    30,      10,
1991,   01,     01,   01,    30,      21,
1992,   01,     01,   01,    30,      4,
1993,   01,     01,   01,    30,      6,
1994,   01,     01,   01,    30,      10,
1995,   01,     01,   01,    30,      23,
1996,   01,     01,   01,    30,      0,
1997,   01,     01,   01,    30,      0,
1998,   01,     01,   01,    30,      0,
2000,   01,     01,   01,    00,      2,
2000,   01,     01,   01,    15,      2,
2000,   01,     01,   01,    30,      NA,
2000,   01,     01,   01,    45,      3,
2000,   01,     01,   02,    00,      4,
2000,   01,     01,   02,    15,      5)

And then what we do is group by month, day, hour and minute so that we can create a running mean for that time of that day in previous years. The most recent of which will then be used if there's an NA value the following year.

A <- A %>% 
  group_by(month, day, hour, minute) %>% 
  mutate(running_mean = round(cummean(rain), 0),
         most_recent_mean = lag(running_mean),
         rain = if_else(is.na(rain), most_recent_mean, rain)) %>%
  select(-running_mean, -most_recent_mean)

And that should do the trick. The printed result is as follows:

# A tibble: 15 x 6
# Groups:   month, day, hour, minute [6]
    year month   day  hour minute  rain
   <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1  1990     1     1     1     30    10
 2  1991     1     1     1     30    21
 3  1992     1     1     1     30     4
 4  1993     1     1     1     30     6
 5  1994     1     1     1     30    10
 6  1995     1     1     1     30    23
 7  1996     1     1     1     30     0
 8  1997     1     1     1     30     0
 9  1998     1     1     1     30     0
10  2000     1     1     1      0     2
11  2000     1     1     1     15     2
12  2000     1     1     1     30     8
13  2000     1     1     1     45     3
14  2000     1     1     2      0     4
15  2000     1     1     2     15     5

Thanks for your help Hlynur. Man could you give me an advise about how to solve this? https://stackoverflow.com/questions/48835464/how-to-manage-a-table-to-obtain-information-using-conditions — Felipe Rincón, Feb 17 '18 at 14:09

score 1 · Answer 4 · answered Jan 18 '18 at 06:15

You can coalesce the rain column to the average (without NA) by year, month, day:

library(data.table)
library(hutils)

setDT(A)[, 
         rain := coalesce(rain, mean(rain, na.rm = TRUE)),
         keyby = c("year", "month", "day")]

There is a nearly identical function from dplyr if you don't want to install hutils, though hutils::coalesce is faster:

 copy(A)[, `:=`(rain2, hutils::coalesce(rain, mean(rain, na.rm = TRUE))),      keyby = c("year", "month", "day")]
  copy(A)[, `:=`(rain2, dplyr::coalesce(rain, mean(rain, na.rm = TRUE))),      keyby = c("year", "month", "day")]
      min       lq     mean   median       uq      max neval cld
 107.1927 118.8372 132.8129 121.4076 128.0383  634.658   100  a 
 887.1115 921.6607 960.4542 943.0711 980.5143 1145.658   100   b

Generate sample file:

A <- 
  CJ(year = 1990:2000,
     month = 1:12,
     day = 1:31,  # yeah I know not all months have 31 days; doesn't matter here
     hour = 0:23,
     minute = 15L * c(0:3))
A[, rain := NA_real_]
A[sample.int(nrow(A), size = 0.8 * nrow(A)), rain := rexp(0.8 * nrow(A), rate = 2)]

Thanks for sharing the information of `coalesce` function from the `hutils` package. — www, Jan 18 '18 at 16:21

How to replace NA data of specific dates with the average data of different years of same dates of a dataframe in R?

4 Answers4