0

My data essentially looks like this, in a much shortened form:

df <- data.frame(id = c(1,1,1,2,2,2,2,2,3,3,3,3), 
                 height = c(150, NA, 151, NA, NA, 176, 175, 174, 198, NA, 197, 198))

What I would like to do is compute the mean height for each of these IDs and then plug that height in for every NA for that given ID. So ID 1 should have a mean height of 150.5, thus the first NA should be replaced by 150.5. Then ID 2 has a mean height of 175, so I'd like to plug in 175 for the two NAs associated with ID 2. And so on.

I know I could manually enter these with things like df[2,2] <- 150.5, but in reality I have thousands of IDs and this wouldn't be feasible.

I'm pretty comfortable with the dplyr package and I figure I should utilize group_by(id) somehow, but I can't figure out the rest.

Any suggestions?

Dan W
  • 19
  • 4

4 Answers4

3

An approach using replace_na

library(dplyr)
library(tidyr)

df %>% 
  mutate(height_new = replace_na(height, mean(height, na.rm=T)), .by = id)
   id height height_new
1   1    150   150.0000
2   1     NA   150.5000
3   1    151   151.0000
4   2     NA   175.0000
5   2     NA   175.0000
6   2    176   176.0000
7   2    175   175.0000
8   2    174   174.0000
9   3    198   198.0000
10  3     NA   197.6667
11  3    197   197.0000
12  3    198   198.0000
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • This works, thank you! Though I initially got an error that read "Can't convert from 'replace' to 'data' due to loss of precision", and a google search told me that R just gets confused if you try to convert an integer into a non-integer. In this case, it thought the heights were integers, and I was trying to plug in a decimal value. I just inserted the code ```df$height <- as.numeric(df$height)``` first, and that fixed the issue. – Dan W Aug 18 '23 at 15:29
  • 1
    DanW, what you experienced there is class-safe enforcement. While it might not seem important to distinguish between integers and floating-point, the other examples make a lot more sense. For example, `base::ifelse` [is not class safe](https://stackoverflow.com/q/6668963/3358272), leading to "fun" differences between (say) `ifelse(c(T,T), 1:2, c("A","B"))` (returns `1:2`) and `ifelse(c(T,F), 1:2, c("A","B"))` (returns `c("1", "B")`, very different). Enforcing class-consistency helps you from inadvertently changing the data. Most (all?) dplyr/tidyr functions enforce it. – r2evans Aug 18 '23 at 15:32
  • 1
    Thanks @r2evans, was about to write something similar! It's a feature not a bug :) – Andre Wildberg Aug 18 '23 at 15:37
  • 1
    To be complete: `data.table` also enforces class-consistency; `purrr` requires one to be explicit in class, e.g., `map` vs `map_dbl` vs `map_chr`; `ggplot2` does _not_ tend to complain b/w int/dbl, but will complain about number/string, as it should, but many base plots often just warn with `xy.coords(x, y) : NAs introduced by coercion`. I'm not saying one should migrate away from base R for this benefit, but data-wranglers should always be aware of things happening to their data, including casting/coercing from one class to another, be aware of what tools do it silently. – r2evans Aug 18 '23 at 15:51
2

Using dpylr::coalesce and a group_by or the new .by argument you could do

library(dplyr, warn = FALSE)

df |>
  mutate(
    height = coalesce(height, mean(height, na.rm = TRUE)),
    .by = id
  )
#>    id   height
#> 1   1 150.0000
#> 2   1 150.5000
#> 3   1 151.0000
#> 4   2 175.0000
#> 5   2 175.0000
#> 6   2 176.0000
#> 7   2 175.0000
#> 8   2 174.0000
#> 9   3 198.0000
#> 10  3 197.6667
#> 11  3 197.0000
#> 12  3 198.0000
stefan
  • 90,330
  • 6
  • 25
  • 51
2

Base R:

df$height2 <- with(df, ave(height, id, FUN = function(z) ifelse(is.na(z), mean(z, na.rm = TRUE), z)))
df
#    id height  height2
# 1   1    150 150.0000
# 2   1     NA 150.5000
# 3   1    151 151.0000
# 4   2     NA 175.0000
# 5   2     NA 175.0000
# 6   2    176 176.0000
# 7   2    175 175.0000
# 8   2    174 174.0000
# 9   3    198 198.0000
# 10  3     NA 197.6667
# 11  3    197 197.0000
# 12  3    198 198.0000

I wrote to height2 so you could easily compare the two, feel free to overwrite df$height instead.

ave is a group-calculation function: it takes a vector of values (height), one or more grouping variables (id), and a function. The function is passed one vector for each group, so the first time it is called, z is c(150, NA, 151), second time it is c(NA, NA, 176, 175, 174). Our function then just does a conditional replacement of the NA values.

An alternative to the ifelse is:

FUN = function(z) replace(z, is.na(z), mean(z, na.rm = TRUE))

The only time this will run into a glitch is if all height within an id are NA (not present here); the result will neither warn nor error, but it will return NaN (not a number).

r2evans
  • 141,215
  • 6
  • 77
  • 149
1

Another option is to use na.aggregate from zoo, which you can use in a dplyr workflow. mean is the default function in na.aggregate.

library(dplyr)
library(zoo)

df %>%
  group_by(id) %>%
  mutate(height = na.aggregate(height))

Output

   id   height
1   1 150.0000
2   1 150.5000
3   1 151.0000
4   2 175.0000
5   2 175.0000
6   2 176.0000
7   2 175.0000
8   2 174.0000
9   3 198.0000
10  3 197.6667
11  3 197.0000
12  3 198.0000

You can also use na.aggregate with data.table:

library(data.table)
setDT(df)

df[, height := na.aggregate(height), by=id]
AndrewGB
  • 16,126
  • 5
  • 18
  • 49