I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr
and tidyr
. It isn't working as I'd expect.
library(dplyr)
library(tidyr)
df <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)
This results in:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 joe@email.com
6 3 joe@email.com
I expect it to be:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob@email.com
2 1 bob@email.com
3 2 joe@email.com
4 2 joe@email.com
5 3 NA
6 3 NA
The reason I expect it to be the latter is because of group_by
's documentation saying, "The group_by
function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id
variable, and the following operation is fill(email)
. However, it's pretty clearly NOT doing that.
And before anybody asks, it makes no difference if the fields are both character
instead of numeric
or factor
.
UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.