13

I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr and tidyr. It isn't working as I'd expect.

library(dplyr)
library(tidyr)

df <- data.frame(id=c(1,1,2,2,3,3),
                 email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)

This results in:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 bob@email.com
2     1 bob@email.com
3     2 joe@email.com
4     2 joe@email.com
5     3 joe@email.com
6     3 joe@email.com

I expect it to be:

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 bob@email.com
2     1 bob@email.com
3     2 joe@email.com
4     2 joe@email.com
5     3 NA
6     3 NA

The reason I expect it to be the latter is because of group_by's documentation saying, "The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id variable, and the following operation is fill(email). However, it's pretty clearly NOT doing that.


And before anybody asks, it makes no difference if the fields are both character instead of numeric or factor.


UPDATE @aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.

doicomehereoften1
  • 537
  • 1
  • 4
  • 12
  • 1
    Looks like there is an [open issue](https://github.com/hadley/tidyr/issues/129) about this on the github repository – aosmith Dec 29 '15 at 19:19
  • If your question is how to do this in dplyr without `fill` (which doesn't currently appear to respect grouping), there are duplicates [here](http://stackoverflow.com/questions/27207162/fill-in-na-based-on-the-last-non-na-value-for-each-group-in-r) and [here](http://stackoverflow.com/questions/23340150/using-dplyr-window-functions-to-make-trailing-values) – aosmith Dec 29 '15 at 19:41
  • Thanks for the Github issue link! I did end up making a workaround using `ddply()` with `fill()`, but I guess the fact that the issue exists means that a proper solution will only come as a resolution to that issue. – doicomehereoften1 Dec 29 '15 at 20:05
  • As a work-around that doesn't require `zoo`, do see [Wojciech's answer over here](http://stackoverflow.com/a/27461663/903061). – Gregor Thomas Dec 29 '15 at 22:25

6 Answers6

11

Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill from tidyr_0.3.1.9000.

df %>% group_by(id) %>% fill(email)

Source: local data frame [6 x 2]
Groups: id [3]

     id         email
  (dbl)        (fctr)
1     1 bob@email.com
2     1 bob@email.com
3     2 joe@email.com
4     2 joe@email.com
5     3            NA
6     3            NA
aosmith
  • 34,856
  • 9
  • 84
  • 118
8

Luckily you can still use zoo::na.locf for this:

df %>% 
    group_by(id) %>% 
    mutate(email = zoo::na.locf(email, na.rm = FALSE))  
# Source: local data frame [6 x 2]
# Groups: id [3]
# 
#      id         email
#   (dbl)        (fctr)
# 1     1 bob@email.com
# 2     1 bob@email.com
# 3     2 joe@email.com
# 4     2 joe@email.com
# 5     3            NA
# 6     3            NA
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • 1
    Good idea, but I and my coworkers consider it poor practice to use a package whose primary purpose is time series analysis for data normalization. If this were a time series problem it would be fair game, but not here. – doicomehereoften1 Dec 29 '15 at 20:00
2

Another option is to use do from dplyr:

df3 <- df %>% group_by(id) %>% do(fill(.,email))
Erwan LE PENNEC
  • 516
  • 3
  • 10
  • This is probably the best workaround listed. However, I'm still going to stand by my note above about it not being properly answered until there's a fix to the package. Thanks! – doicomehereoften1 Dec 30 '15 at 17:02
0

Two questions, does it has be duplicated and do you have to use dplyr and tidyr?

Maybe this could be a solution?

(
bar <- data.frame(id=c(1,1,2,2,3,3),
                 email=c('bob@email.com', NA, 'joe@email.com', NA, NA, NA))
)                 
#> id         email
#>  1 bob@email.com
#>  1          <NA>
#>  2 joe@email.com
#>  2          <NA>
#>  3          <NA>
#>  3          <NA>

(                 
foo <- bar[!duplicated(bar$id),]
)
#> id         email
#>  1 bob@email.com
#>  2 joe@email.com
#>  3          <NA>
Eric Fail
  • 8,191
  • 8
  • 72
  • 128
  • 2
    There's more data to the right in the actual data that I'm working on, I just wanted to illustrate the problem in the most terse manner possilbe. – doicomehereoften1 Dec 29 '15 at 19:58
0

This is kind of ugly, but it is another option that uses dplyr and works with your sample data

df %>%
   group_by(id) %>%
   mutate(email = email[ !is.na(email) ][1])
steveb
  • 5,382
  • 2
  • 27
  • 36
0

I have come across this issue quite a few times, I do worry about using this..

df2 <- df %>% group_by(id) %>% fill(email)

on large data sets as I have had mixed results and found the following work around. The split function used with map_df ensures you apply whatever you are doing to the a specific df for each id and map_df then re binds all the individual df like magic. It has also proved handy in lots of other circumstances. Somewhat obsolete now this issue has been fixed but still a useful alternative that avoids group_by().

df %>% split(.$id) %>% map_df(function(x){ x %>% fill(email)})

hammoire
  • 341
  • 1
  • 2
  • 10