7

The question replace NA in a dplyr chain results into the solution

dt %.% group_by(a) %.% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))

with dplyr. I want to impute all colums with dplyr chain. There is no single column to group by, rather I want all numeric columns to have all NAs replaced by the means such as column means.

What is the most elegant way to replace all NAs with column means with tidyverse/dp?

hhh
  • 50,788
  • 62
  • 179
  • 282
  • 4
    Use `dt %>% group_by(a) %>% mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE), .))` If you want elegance, then `dt %>% group_by(a) %>% mutate_all(zoo::na.aggregate)` – akrun Jan 02 '18 at 10:17
  • @akrun what about if you continue with long dplyr chain after the `group_by(a)`? And what is the `a`? Do you have to use `ungroup` after the `group_by`? – hhh Jan 02 '18 at 10:19
  • 1
    It depends on the operation you want afterwards. Sometimes, I use `ungroup` because `spread` etc may not work with group columns – akrun Jan 02 '18 at 10:20
  • 1
    If you are using `rowwise` sum use the `rowSums` i.e. `%>% ungroup %>% mutate(newSum = rowSums(.[columnsofinterest]))` – akrun Jan 02 '18 at 10:24
  • 2
    Possible duplicate of [na.locf using group\_by from dplyr](https://stackoverflow.com/questions/43212308/na-locf-using-group-by-from-dplyr) – h3rm4n Jan 02 '18 at 10:40
  • It would be useful for downvoters to explain their votes so I could clarify the question as needed. I thank akrun for the answer where the `zoo::na.aggregate()` works very well :) – hhh Jan 02 '18 at 10:45
  • 1
    Related: [*Handle Continous Missing values in time-series data*](https://stackoverflow.com/q/32694313/2204410) – Jaap Jan 02 '18 at 11:22

1 Answers1

9

We can use mutate_all with ifelse

dt %>%
   group_by(a) %>% 
   mutate_all(funs(ifelse(is.na(.), mean(., na.rm = TRUE), .)))

If we want a compact option, then use the na.aggregate from zoo which by default replace NA values with mean

dt %>% 
   group_by(a) %>% 
   mutate_all(zoo::na.aggregate)

If we don't have a grouping variable, then remove the group_by and use mutate_if (just to be cautious about having some non-numeric column)

dt %>%
   mutate_if(is.numeric, zoo::na.aggregate)

If all the columns are numeric, even

zoo::na.aggregate(dt)

data

set.seed(42)
dt <- data.frame(a = rep(letters[1:3], each = 3),
                 b= sample(c(NA, 1:5), 9, replace = TRUE), 
                 c = sample(c(NA, 1:3), 9, replace = TRUE))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @hhh The `a` is from your code only.. If you have a different grouping variable, use that one – akrun Jan 02 '18 at 10:26
  • @hhh Isn't that just a row number column? I thought you wanted to group by some column and then do the replacement based on it – akrun Jan 02 '18 at 10:30
  • 1
    @hhh If it is a row number, it is just a unique element. So, You don't need any `group_by`, just do `dt %>% mutate_all(zoo::na.aggregate))` or apply the `zoo::na.aggregate(dt)` on the entire dataset – akrun Jan 02 '18 at 10:33
  • 1
    Thank you, they worked `zoo::na.aggregate(data.frame(a=c(1,2,NaN),b=c(10,NaN,12)))` and `data.frame(a=c(1,2,NaN),b=c(10,NaN,12)) %>% zoo::na.aggregate()`. – hhh Jan 02 '18 at 10:37