How to replace NA with most recent non-NA by group?

Question

I have a DF of individuals with some incomplete and repeated characteristics as following:

    name <- c("A", "A", "B", "B", "B", "C", "D", "D")
    age <- c(28,NA,NA,NA,NA,NA,53,NA)
    birthplace <- c("city1",NA, "city2",NA,NA,NA,NA,NA)
    value <- 100:107
    df <- data.frame(name,age,birthplace,value)

    name age birthplace value
1    A  28      city1   100
2    A  NA       <NA>   101
3    B  NA      city2   102
4    B  NA       <NA>   103
5    B  NA       <NA>   104
6    C  NA       <NA>   105
7    D  53       <NA>   106
8    D  NA       <NA>   107

Since the value is unique for row. I want complete each row with aviable person's detail like this:

       name age birthplace value
    1    A  28      city1   100
    2    A  28      city1   101
    3    B  NA      city2   102
    4    B  NA      city2   103
    5    B  NA      city2   104
    6    C  NA       <NA>   105
    7    D  53       <NA>   106
    8    D  53       <NA>   107

I tried to use

library(zoo)
library(dplyr)
df <- df %>% group_by(name) %>% na.locf(na.rm=F)

But it does't work very well. Any idea for implement function by group?

@alistaire the question you point to asks for a dplyr solution (even if the answers stray from that), whereas here that constraint is not specified. — Martin Morgan, Aug 21 '16 at 16:57
@MartinMorgan The question yes, but not the answers, which cover base, zoo alone, data.table, etc. There's no functional difference in the answers; dplyr is just the grammar used in the question. — alistaire, Aug 21 '16 at 17:00

Martin Morgan · Accepted Answer · 2016-08-22T09:12:49.787

As another base R solution, here is a poor man's na.locf

fill_down <- function(v) {
    if (length(v) > 1) {
        keep <- c(TRUE, !is.na(v[-1]))
        v[keep][cumsum(keep)]
    } else v
}

To fill down by group, the approach is to use tapply() to split and apply to each group, and split<- to combine groups to the original geometry, as

fill_down_by_group <- function(v, grp) {
    ## original 'by hand':
    ##     split(v, grp) <- tapply(v, grp, fill_down)
    ##     v
    ## done by built-in function `ave()`
    ave(v, grp, FUN=fill_down)
}

To process multiple columns, one might

elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df$name)

Notes

I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is
```
library(dplyr); library(tidyr)
df %>% group_by(name) %>% fill_(elts)
```

A more efficient base solution when the groups are already 'grouped' (e.g., identical(grp, sort(grp))) is

fill_down_by_grouped <- function(v, grp) {
    if (length(v) > 1) {
        keep <- !(duplicated(v) & is.na(v))
        v[keep][cumsum(keep)]
    } else v
}

For me, fill_down() on a vector with about 10M elements takes ~225ms; fill_down_by_grouped() takes ~300ms independent of the number of groups; fill_down_by_group() scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s

This is the first time I've ever seen `split<-`. Really good stuff. — Pierre L, Aug 21 '16 at 11:48
I didn't read the OP, but it looks like `ave` is an alternative to `*_by_group` here: `lapply(df[elts], function(x) ave(x, df$name, FUN = fill_down))` ? — Frank, Aug 21 '16 at 17:23
@Frank yes, `ave()` is a good bet that I keep forgetting about, thanks. — Martin Morgan, Aug 21 '16 at 18:19
Thanks for the solution. It works, however, a little bit exaggerated to solve the problem with two functions, isn't it? — Lingyu Kong, Aug 22 '16 at 03:21

score 3 · Answer 2 · answered Aug 21 '16 at 14:22

3

Could also be:

library(dplyr)
library(tidyr)
df %>% group_by(name) %>% fill(age, birthplace)

# Source: local data frame [8 x 4]
# Groups: name [4]

#     name   age birthplace value
#   <fctr> <dbl>     <fctr> <int>
# 1      A    28      city1   100
# 2      A    28      city1   101
# 3      B    NA      city2   102
# 4      B    NA      city2   103
# 5      B    NA      city2   104
# 6      C    NA         NA   105
# 7      D    53         NA   106
# 8      D    53         NA   107

answered Aug 21 '16 at 14:22

Psidom

209,562
33
339
356

3

Handy: `fill(everything())` – alistaire Aug 21 '16 at 15:35
@alistaire As always, a more concise answer. – Psidom Aug 21 '16 at 15:40

score 2 · Answer 3 · answered Aug 21 '16 at 10:45

2

You can wrap the na.locf in do

df %>% group_by(name) %>% do(na.locf(., na.rm = FALSE))

answered Aug 21 '16 at 10:45

Richard Telford

9,558
6
38
51

1

`do()` coerces to character; maybe `mutate(age=na.locf(age, na.rm=FALSE), birthplace=na.locf(birthplace, na.rm=FALSE))` – Martin Morgan Aug 21 '16 at 10:53
1

I think we should use this `df %>% group_by(name) %>% mutate_each(funs(na.locf(.,na.rm = FALSE)))` – user2100721 Aug 21 '16 at 13:12
2

The new version: `df %>% group_by(name) %>% mutate_all(zoo::na.locf, na.rm = FALSE)` Or just use `tidyr::fill` like Psidom's approach. – alistaire Aug 21 '16 at 15:32

score 2 · Answer 4 · answered Aug 21 '16 at 10:54

Depending upon what you are doing next, you may prefer the data in a nested form.

(nested <- df %>% 
  group_by(name) %>% 
  summarize(
    age = na.omit(age)[1], 
    birthplace = na.omit(birthplace)[1], 
    value = list(value)
  )
)
## # A tibble: 4 x 4
##     name   age birthplace     value
##   <fctr> <dbl>     <fctr>    <list>
## 1      A    28      city1 <int [2]>
## 2      B    NA      city2 <int [3]>
## 3      C    NA         NA <int [1]>
## 4      D    53         NA <int [2]>

If you need to compute on individual values, you can always unnest it later.

nested %>% tidyr::unnest()
## # A tibble: 8 x 4
##     name   age birthplace value
##   <fctr> <dbl>     <fctr> <int>
## 1      A    28      city1   100
## 2      A    28      city1   101
## 3      B    NA      city2   102
## 4      B    NA      city2   103
## 5      B    NA      city2   104
## 6      C    NA         NA   105
## 7      D    53         NA   106
## 8      D    53         NA   107

Abdou · Answer 5 · 2016-08-21T10:59:52.623

This is a base R solution:

do.call(rbind,lapply(split(df, df$name), function(x) {
    tempdf <- x
    if (nrow(tempdf) > length(which(is.na(x$birthplace)))) {
        tempdf[which(is.na(x$birthplace)),c("age","birthplace")] <- tempdf[which(is.na(x$birthplace))[1]-1,c("age","birthplace")]
    }
    return(tempdf)
}))

Output:

 name age birthplace value
 A    28  city1      100  
 A    28  city1      101  
 B    NA  city2      102  
 B    NA  city2      103  
 B    NA  <NA>       104  
 C    NA  <NA>       105  
 D    53  <NA>       106  
 D    NA  <NA>       107

G. Grothendieck · Answer 6 · 2016-08-22T17:51:29.593

1

Here is a base R solution. The fill function invokes ave using na.omit(x)[1] as in Richie Cotton's solution.

fill <- function(...) ave(..., FUN = function(x) na.omit(x)[1])
transform(df, birthplace = fill(birthplace, name), age = fill(age, name))

Note: This also works with na.locf. Replace fill with:

library(zoo)
fill <- function(...) ave(..., FUN = function(x) na.locf(x, na.rm = FALSE))

edited Aug 22 '16 at 17:51

answered Aug 21 '16 at 23:59

G. Grothendieck

254,981
17
203
341

score 0 · Answer 7 · answered Aug 21 '16 at 17:42

0

You could this through a merge too. Just do a join on name column. Then do a group by on value.

library(sqldf)
sqldf('select t1.name, t2.age, t2.birthplace,t1.value from df t1 inner join df t2 on t1.name=t2.name group by t1.value')

answered Aug 21 '16 at 17:42

Chirayu Chamoli

2,076
1
17
32

score 0 · Answer 8 · answered Aug 22 '16 at 16:03

Consider also a nested apply base solution running a rolling head() for each column:

df <- setNames(data.frame(lapply(names(df), function(d)
               sapply(1:nrow(df), function(i)
                      head(df[df[1:i, c("name")] == df$name[i], c(d)], 1))
        )), names(df))

How to replace NA with most recent non-NA by group?

8 Answers8

Notes

Linked

Related