group_by() into fill() not working as expected

Question

I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr and tidyr. It isn't working as I'd expect.

library(dplyr)
library(tidyr)

df <- data.frame(id=c(1,1,2,2,3,3),
                 email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)

This results in:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 bob@email.com
2     1 bob@email.com
3     2 joe@email.com
4     2 joe@email.com
5     3 joe@email.com
6     3 joe@email.com

I expect it to be:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 bob@email.com
2     1 bob@email.com
3     2 joe@email.com
4     2 joe@email.com
5     3 NA
6     3 NA

The reason I expect it to be the latter is because of group_by's documentation saying, "The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id variable, and the following operation is fill(email). However, it's pretty clearly NOT doing that.

And before anybody asks, it makes no difference if the fields are both character instead of numeric or factor.

UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.

Looks like there is an [open issue](https://github.com/hadley/tidyr/issues/129) about this on the github repository — aosmith, Dec 29 '15 at 19:19
If your question is how to do this in dplyr without `fill` (which doesn't currently appear to respect grouping), there are duplicates [here](http://stackoverflow.com/questions/27207162/fill-in-na-based-on-the-last-non-na-value-for-each-group-in-r) and [here](http://stackoverflow.com/questions/23340150/using-dplyr-window-functions-to-make-trailing-values) — aosmith, Dec 29 '15 at 19:41
Thanks for the Github issue link! I did end up making a workaround using `ddply()` with `fill()`, but I guess the fact that the issue exists means that a proper solution will only come as a resolution to that issue. — doicomehereoften1, Dec 29 '15 at 20:05
As a work-around that doesn't require `zoo`, do see [Wojciech's answer over here](http://stackoverflow.com/a/27461663/903061). — Gregor Thomas, Dec 29 '15 at 22:25

score 11 · Accepted Answer · answered Dec 30 '15 at 23:07

11

Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill from tidyr_0.3.1.9000.

df %>% group_by(id) %>% fill(email)

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 bob@email.com
2     1 bob@email.com
3     2 joe@email.com
4     2 joe@email.com
5     3            NA
6     3            NA

answered Dec 30 '15 at 23:07

aosmith

34,856
9
84
118

Indeed, this is fixed in the new 0.4.0 version available in CRAN. – Erwan LE PENNEC Feb 03 '16 at 10:24

score 8 · Answer 2 · answered Dec 29 '15 at 19:36

8

Luckily you can still use zoo::na.locf for this:

df %>% 
    group_by(id) %>% 
    mutate(email = zoo::na.locf(email, na.rm = FALSE))  
# Source: local data frame [6 x 2]
# Groups: id [3]
# 
#      id         email
#   (dbl)        (fctr)
# 1     1 bob@email.com
# 2     1 bob@email.com
# 3     2 joe@email.com
# 4     2 joe@email.com
# 5     3            NA
# 6     3            NA

answered Dec 29 '15 at 19:36

Gregor Thomas

136,190
20
167
294

1

Good idea, but I and my coworkers consider it poor practice to use a package whose primary purpose is time series analysis for data normalization. If this were a time series problem it would be fair game, but not here. – doicomehereoften1 Dec 29 '15 at 20:00

score 2 · Answer 3 · answered Dec 30 '15 at 09:42

2

Another option is to use do from dplyr:

df3 <- df %>% group_by(id) %>% do(fill(.,email))

answered Dec 30 '15 at 09:42

Erwan LE PENNEC

516
3
10

This is probably the best workaround listed. However, I'm still going to stand by my note above about it not being properly answered until there's a fix to the package. Thanks! – doicomehereoften1 Dec 30 '15 at 17:02

score 0 · Answer 4 · answered Dec 29 '15 at 19:20

Two questions, does it has be duplicated and do you have to use dplyr and tidyr?

Maybe this could be a solution?

(
bar <- data.frame(id=c(1,1,2,2,3,3),
                 email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
)                 
#> id         email
#>  1 bob@email.com
#>  1          <NA>
#>  2 joe@email.com
#>  2          <NA>
#>  3          <NA>
#>  3          <NA>

(                 
foo <- bar[!duplicated(bar$id),]
)
#> id         email
#>  1 bob@email.com
#>  2 joe@email.com
#>  3          <NA>

There's more data to the right in the actual data that I'm working on, I just wanted to illustrate the problem in the most terse manner possilbe. — doicomehereoften1, Dec 29 '15 at 19:58

score 0 · Answer 5 · answered Dec 30 '15 at 08:04

0

This is kind of ugly, but it is another option that uses dplyr and works with your sample data

df %>%
   group_by(id) %>%
   mutate(email = email[ !is.na(email) ][1])

answered Dec 30 '15 at 08:04

steveb

5,382
2
27
36

score 0 · Answer 6 · answered Feb 12 '20 at 16:22

I have come across this issue quite a few times, I do worry about using this..

df2 <- df %>% group_by(id) %>% fill(email)

on large data sets as I have had mixed results and found the following work around. The split function used with map_df ensures you apply whatever you are doing to the a specific df for each id and map_df then re binds all the individual df like magic. It has also proved handy in lots of other circumstances. Somewhat obsolete now this issue has been fixed but still a useful alternative that avoids group_by().

df %>% split(.$id) %>% map_df(function(x){ x %>% fill(email)})

group_by() into fill() not working as expected

6 Answers6

Linked

Related