Use tidyverse to replace NA with mean of data, by group

Question

I'm attempting to write a function that replaces NA in a numeric data.frame column with the mean, by group, of the data that is present for that variable. I realise this is imputation and there are packages for that, would prefer to do this myself, and the mean is just an example, will use a more sophisticated function. I've attempted to produce a mwe, but I get stuck near the end. I'm trying, where possible to stick to using tidyverse methods.

library(tidyverse)
## First create a little dataset for a minimum working example for questions
## three vectors
id <- c(rep("boh1", 6), rep("boh2", 6), rep("boh3", 6), rep("boh4", 6))
operator <- rep(c("op1", "op2"), each = 12)
nummos <- c(1, 4, 4, 3, 1, NA, 4, 2, 2, 3, 4, 4, NA, 1, 1, 5,
                     5, 4, 5, 3, 2, NA, 3, 3)
## combine vectors into df
dat1 <- data.frame(id, operator, nummos)
## group by two variables and get mean of variable by group
dat2 <- dat1 %>%
    group_by(id, operator) %>%
    summarize(mean = mean(nummos, na.rm=TRUE))
## now stuck, how to replace NA by mean value appropriate for that group?

cbo · Accepted Answer · 2019-10-28T14:07:48.193

3

Use mutate and dplyr::case_when instead of summarise :

dat1 %>%
    group_by(id, operator) %>%
    mutate(nummos2 = case_when(is.na(nummos) ~ mean(nummos, na.rm=TRUE),
                               TRUE ~ as.numeric(nummos) 
                              )
           )

edited Oct 28 '19 at 14:07

answered Oct 28 '19 at 14:01

cbo

1,664
1
12
27

What's the benefit of `case_when` over base `ifelse` if you're only testing 1 condition? – camille Oct 28 '19 at 14:18
This is true I believe that both do work with the same performance. One can stay in `tidyverse` or use `base` here. – cbo Oct 28 '19 at 14:24
Appears to work nicely, Thx. – Jim Maas Oct 28 '19 at 14:27
Sure, but just because you're using other `dplyr` functions, that doesn't mean you shouldn't also use base functions, especially in cases where a base function is simpler or more appropriate (e.g. for checking 1 condition, rather than several, which is what `case_when` is designed for) – camille Oct 28 '19 at 14:31
@camille, thanks for this and it is very helpful. Can you please remove the comments above that suggest that it has been answered before? I found both of those answers and could not, with my intermediate ability, formulate any of the four very good solutions that have been volunteered. If we were all already experts at everything, we wouldn't be asking questions here. Happy to discuss, thanks, J. – Jim Maas Oct 30 '19 at 08:37
1

@JimMaas you're asking me to reopen the question? I can't. You can look at the [help section](https://stackoverflow.com/help/duplicates) for reference on duplicates. It's not necessarily a bad thing to have a question closed as a duplicate. – camille Oct 30 '19 at 13:39

score 3 · Answer 2 · answered Oct 28 '19 at 14:11

You can simply define your own function using replace(). Try out:

dat1 %>% 
        group_by(id, operator) %>% 
        mutate_at("nummos", function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))
# output
# A tibble: 24 x 3
# Groups:   id, operator [4]
   id    operator nummos
   <fct> <fct>     <dbl>
 1 boh1  op1         1  
 2 boh1  op1         4  
 3 boh1  op1         4  
 4 boh1  op1         3  
 5 boh1  op1         1  
 6 boh1  op1         2.6
 7 boh2  op1         4  
 8 boh2  op1         2  
 9 boh2  op1         2  
10 boh2  op1         3  
# ... with 14 more rows

score 2 · Answer 3 · answered Oct 28 '19 at 14:03

I'm not really familiar with tidyverse, so here goes a data.table solution:

library(data.table) # load package
setDT(dat1) # convert data.frame to data.table

Now, I'll create a data.table with the mean of nummos by c(id, operator) and merge it with dat1, filling the NAs with the calculated values:

dat1[dat1[, mean(nummos, na.rm = TRUE), by = .(id, operator)], nummos := ifelse(is.na(nummos), i.V1, nummos), on = .(id, operator)]

The dat1[, mean(nummos, na.rm = TRUE), by = .(id, operator)] is a small data.table with the means by group.

The nummos := ifelse... part does de assignment only when nummos is NA.

dat1
      id operator nummos
 1: boh1      op1    1.0
 2: boh1      op1    4.0
 3: boh1      op1    4.0
 4: boh1      op1    3.0
 5: boh1      op1    1.0
 6: boh1      op1    2.6
 7: boh2      op1    4.0
 8: boh2      op1    2.0
 9: boh2      op1    2.0
10: boh2      op1    3.0
11: boh2      op1    4.0
12: boh2      op1    4.0
13: boh3      op2    3.2
14: boh3      op2    1.0
15: boh3      op2    1.0
16: boh3      op2    5.0
17: boh3      op2    5.0
18: boh3      op2    4.0
19: boh4      op2    5.0
20: boh4      op2    3.0
21: boh4      op2    2.0
22: boh4      op2    3.2
23: boh4      op2    3.0
24: boh4      op2    3.0
  id operator nummos

Jaap · Answer 4 · 2019-10-28T14:44:50.563

2

Another solution with data.table with the (pretty new) nafill-function:

library(data.table)
setDT(dat1)

dat1[, nummos := nafill(nummos, "const", fill = mean(nummos, na.rm = TRUE))
     , by = .(id, operator)]

And a tidyverse solution using na.aggregate from the zoo-package:

dat1 %>%
  group_by(id, operator) %>%
  mutate_at("nummos", zoo::na.aggregate)

edited Oct 28 '19 at 14:44

answered Oct 28 '19 at 14:31

Jaap

81,064
34
182
193

Jaap, how would this function know that I wanted to replace with the mean? Thx – Jim Maas Oct 28 '19 at 14:38
@JimMaas the default is mean; if you want to use another function you should specify that; see also the help-file: `?zoo::na.aggregate` – Jaap Oct 28 '19 at 14:46
Jaap, thanks for this and it is very helpful. Can you please remove the comments above that suggest that it has been answered before? I found both of those answers and could not, with my intermediate ability, formulate any of the four very good solutions that have been volunteered. If we were all already experts at everything, we wouldn't be asking questions here. Happy to discuss, thanks, J. – Jim Maas Oct 30 '19 at 08:38
@JimMaas I could, but imho this is a correct duplicate. – Jaap Oct 31 '19 at 06:35

Use tidyverse to replace NA with mean of data, by group

4 Answers4

Linked