41

Question has been edited from the original.

After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data:

Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID         G
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        NA

The following does not work as I expected

library(dplyr)
library(Lahman)

df <- Batting[ c("yearID", "teamID", "G") ]
df <- group_by(df, teamID )
df$G[is.na(df$G)] <- mean(df$G, na.rm = TRUE)

Source: local data frame [20 x 3] Groups: yearID, teamID

   yearID teamID         G
1    2004    SFN  11.00000
2    2006    CHN  43.00000
3    2007    CHA   2.00000
4    2008    BOS   5.00000
5    2009    SEA   3.00000
6    2010    SEA   4.00000
7    2012    NYA  **49.07894**

> mean(Batting$G_battin, na.rm = TRUE)
[1] **49.07894**

In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using transform from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this?

df %.% 
  group_by( yearID ) %.%
  transform(G = ifelse(is.na(G), 
    mean(G, na.rm = TRUE), 
    G)
  )

Edit: Replacing transform with mutate gives the following error

Error in mutate_impl(.data, named_dots(...), environment()) : 
  INTEGER() can only be applied to a 'integer', not a 'double'

Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also @eddi's answer.

df %.% 
  group_by( teamID ) %.%
  mutate(G = ifelse(is.na(G), as.integer(mean(G, na.rm = TRUE)), G))

Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID         G
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        47

> mean_NYA <- mean(filter(df, teamID == "NYA")$G, na.rm = TRUE)
> as.integer(mean_NYA)
[1] 47

Edit: Following up on @Romain's comment I installed dplyr from github:

> head(df,10)
   yearID teamID         G
1    2004    SFN        11
2    2006    CHN        43
3    2007    CHA         2
4    2008    BOS         5
5    2009    SEA         3
6    2010    SEA         4
7    2012    NYA        NA
8    1954    ML1       122
9    1955    ML1       153
10   1956    ML1       153

> df %.% 
+   group_by(teamID)  %.%
+   mutate(G = ifelse(is.na(G), mean(G, na.rm = TRUE), G))
Source: local data frame [96,600 x 3]
Groups: teamID

   yearID teamID          G
1    2004    SFN          0
2    2006    CHN          0
3    2007    CHA          0
4    2008    BOS          0
5    2009    SEA          0
6    2010    SEA 1074266112
7    2012    NYA   90693125
8    1954    ML1        122
9    1955    ML1        153
10   1956    ML1        153
..    ...    ...        ...

So I didn't get the error (good) but I got a (seemingly) strange result.

Mark
  • 7,785
  • 2
  • 14
  • 34
Vincent
  • 5,063
  • 3
  • 28
  • 39
  • 1
    The error message is confusing but the root of the problem is the confusing semantics of `ifelse`. I've added an issue to make sure I think about it more https://github.com/hadley/dplyr/issues/254 – hadley Feb 12 '14 at 14:59
  • I cannot reproduce the error with the devel version of dplyr. – Romain Francois Feb 12 '14 at 15:13
  • Thanks hadley. @Romain, thanks for the suggestion. I installed the master branch from hadley/dplyr and get the results above. Different from your result in issues #254 – Vincent Feb 12 '14 at 17:55

2 Answers2

33

The main issue you're having is that mean returns a double while the G column is an integer. So wrapping the mean in as.integer would work, or you'd need to convert the entire column to numeric I guess.

That said, here are a couple of data.table alternatives - I didn't check which one is faster.

library(data.table)

# using ifelse
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]

# using a temporary column
dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8))
dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]

And this is what I'd want to do ideally (there is an FR about this):

# again, atm this is pure fantasy and will not work
dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]

The dplyr version of the ifelse is (as in OP):

dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))

I'm not sure how to implement the second data.table idea in a single line in dplyr. I'm also not sure how you can stop dplyr from scrambling/ordering the data (aside from creating an index column).

Mark
  • 7,785
  • 2
  • 14
  • 34
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    I think there's a NaN because all values corresponding to that `teamID, yearID` are `NA`. – Arun Feb 12 '14 at 07:15
  • Thanks for the data.table alternatives @eddi. At some point I really have to figure out how all the []s work (your temporary column option). – Vincent Feb 12 '14 at 17:21
  • 2
    `mutate` with `ifelse` was exactly what I needed as well. But can I replace all NA's in the entire data frame without explicitly naming the columns? – Bobby Oct 05 '16 at 13:28
  • @Bobby in `dplyr` there is some function that iterates through columns - I don't remember the name but should be easy to find. In `data.table` you'd do regular R looping: `dt[, names(dt) := lapply(.SD, function(x) ifelse(is.na(x), mean(x, na.rm = T), x)), by = a]` – eddi Oct 05 '16 at 15:40
  • In the original data frame syntax `df[is.na(df)] <- 0` replaces all NA values, no looping required, though generally replacing all NAs in all columns with the same thing would be an odd use case. – Matt Anthony Nov 27 '17 at 16:01
  • How can you impute all columns with the command`dt %.% group_by(a) %.% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))` with any column name? – hhh Jan 02 '18 at 10:13
  • 7
    @Bobby, you can replace all `NA` in a df in `dplyr` with `df %>% mutate_all(.funs = funs(ifelse(is.na(.), 0, .)))` . Or a subset of columns with `df %>% mutate_at(.vars = vars(yearID, G_batting), .funs = funs(ifelse(is.na(.), 0, .)))` – user2739472 Mar 22 '18 at 11:52
0

Using the modern .by = argument, and the relatively modern if_else() function:

Batting |>
  select(yearID, teamID, G) |>
  mutate(G = if_else(is.na(G), G, mean(G, na.rm = TRUE)), .by = teamID)

Output:

# A tibble: 112,184 × 3
   yearID teamID     G
    <int> <fct>  <dbl>
 1   1871 TRO    20.2 
 2   1871 RC1    20.5 
 3   1871 CL1    18.4 
 4   1871 WS3    13.3 
 5   1871 RC1    20.5 
 6   1871 FW1     9.67
 7   1871 RC1    20.5 
 8   1871 BS1    45.5 
 9   1871 FW1     9.67
10   1871 BS1    45.5 
# ℹ 112,174 more rows
Mark
  • 7,785
  • 2
  • 14
  • 34