Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable

Question

I want to keep distinct rows on a data frame, with an algorithm that chooses the last value per group (as dplyr::distinct() does by default), but only if it's not NA. I've seen this great answer on SO that relies on data.table, but I can't scale it to data with more than one grouping variable.

To demonstrate the problem, I start with the minimal example that does work, then scale it up. So first, consider the following data:

library(tibble)

df_id_and_type <-
  tibble::tribble(
        ~id, ~type,
          1,   "A",
          1,    NA,
          2,   "B",
          3,   "A",
          3,    NA,
          3,   "D",
          3,    NA,
          4,    NA,
          4,   "C",
          5,   "A",
          6,    NA,
          6,   "B",
          6,    NA
        )

I want to get the distinct type values per id, by choosing the last value unless it's NA. If the last is NA then go up until there's non-NA. So this answer shows us how to do it with data.table:

library(data.table)

dt_id_and_type        <- as.data.table(df_id_and_type)
dt_id_and_type$typena <- is.na(dt_id_and_type$type)
setorderv(dt_id_and_type, c("typena","id"), order = c(-1, 1))
dt_id_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE]
#>    id type
#> 1:  1    A
#> 2:  2    B
#> 3:  3    D
#> 4:  4    C
#> 5:  5    A
#> 6:  6    B

But what to do if we have more than one grouping variable (i.e, not only id)? In the following example I add a year variable:

df_id_year_and_type <-
  df_id_and_type %>%
  add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012), 
             .before = "type")

df_id_year_and_type
#> # A tibble: 13 x 3
#>       id  year type 
#>    <dbl> <dbl> <chr>
#>  1     1  2002 A    
#>  2     1  2002 <NA> 
#>  3     2  2008 B    
#>  4     3  2010 A    
#>  5     3  2010 <NA> 
#>  6     3  2010 D    
#>  7     3  2013 <NA> 
#>  8     4  2020 <NA> 
#>  9     4  2020 C    
#> 10     5  2009 A    
#> 11     6  2010 <NA> 
#> 12     6  2010 B    
#> 13     6  2012 <NA>

My expected output would be:

## # A tibble: 8 x 3
##      id  year type 
##   <dbl> <dbl> <chr>
## 1     1  2002 A    
## 2     2  2008 B    
## 3     3  2010 D    
## 4     3  2013 NA   # for id 3 in year 2013 there was only `NA`, so that's what we get
## 5     4  2020 C    
## 6     5  2009 A    
## 7     6  2010 B    
## 8     6  2012 NA   # same as comment above

Any idea how I could scale the solution that worked in 1-grouping-var case to the current data? The first 2 lines of code are no-brainer:

dt_id_year_and_type        <- as.data.table(df_id_year_and_type)
dt_id_year_and_type$typena <- is.na(dt_id_year_and_type$type)
setorderv(dt_id_year_and_type, c("typena","id"), order = c(-1, 1)) # <--- how to account for `year`?
dt_id_year_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE] # <--- here too...

score 2 · Accepted Answer · answered Nov 28 '21 at 15:39

Here some data.table-based solutions.

setDT(df_id_year_and_type)

method 1

na.omit(df_id_year_and_type, cols="type") drops NA rows based on column type. unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE) finds all the groups. And by joining them (using the last match: mult="last"), we obtain the desired output.

na.omit(df_id_year_and_type, cols="type"
        )[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE), 
          on=c('id', 'year'), 
          mult="last"]

#       id  year   type
#    <num> <num> <char>
# 1:     1  2002      A
# 2:     2  2008      B
# 3:     3  2010      D
# 4:     3  2013   <NA>
# 5:     4  2020      C
# 6:     5  2009      A
# 7:     6  2010      B
# 8:     6  2012   <NA>

method 2

df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]

method 3

(likely slower because of [ overhead)

df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]

score 1 · Answer 2 · answered Nov 28 '21 at 12:39

I would propose this solution in which you exclude the unwanted rows prior to unique. If all observations for a group are NA, sum(is.na(x)) / .N is equal to 1 and we proceed from there

library(tibble)
library(data.table)

df_id_and_type <-
  tibble::tribble(
    ~id, ~type,
    1,   "A",
    1,    NA,
    2,   "B",
    3,   "A",
    3,    NA,
    3,   "D",
    3,    NA,
    4,    NA,
    4,   "C",
    5,   "A",
    6,    NA,
    6,   "B",
    6,    NA
  )


df_id_year_and_type <-
  df_id_and_type %>%
  add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012), 
             .before = "type")

# convert to data.table
dt_id_year_and_type <- as.data.table(df_id_year_and_type)

# define grouping vars
grouping_vars <- c("id", "year")

# are all types na for a group?
dt_id_year_and_type[, na_ratio := sum(is.na(type)) / .N, 
                    by = c(grouping_vars)]

# remove all lines that are NA, except they are from a group in which all 
# observations are NA
dt_id_year_and_type <- dt_id_year_and_type[na_ratio == 1 | !is.na(type)]

# sort correctly
setorderv(dt_id_year_and_type, grouping_vars) 
dt_id_year_and_type
#>    id year type  na_ratio
#> 1:  1 2002    A 0.5000000
#> 2:  2 2008    B 0.0000000
#> 3:  3 2010    A 0.3333333
#> 4:  3 2010    D 0.3333333
#> 5:  3 2013 <NA> 1.0000000
#> 6:  4 2020    C 0.5000000
#> 7:  5 2009    A 0.0000000
#> 8:  6 2010    B 0.5000000
#> 9:  6 2012 <NA> 1.0000000

# keep only the last observation of each group
dt_unique <- unique(dt_id_year_and_type, by = grouping_vars, fromLast = TRUE)

remove no longer needed helper column
dt_unique[, na_ratio := NULL]
dt_unique
#>    id year type
#> 1:  1 2002    A
#> 2:  2 2008    B
#> 3:  3 2010    D
#> 4:  3 2013 <NA>
#> 5:  4 2020    C
#> 6:  5 2009    A
#> 7:  6 2010    B
#> 8:  6 2012 <NA>

PaulS · Answer 3 · 2021-11-28T14:30:34.017

Another possible solution:

library(tidyverse) 

df_id_year_and_type %>% 
  group_by(id, year) %>% 
  fill(type, .direction = "downup") %>% 
  summarise(type = last(type), .groups = "drop")

#> # A tibble: 8 × 3
#>      id  year type 
#>   <dbl> <dbl> <chr>
#> 1     1  2002 A    
#> 2     2  2008 B    
#> 3     3  2010 D    
#> 4     3  2013 <NA> 
#> 5     4  2020 C    
#> 6     5  2009 A    
#> 7     6  2010 B    
#> 8     6  2012 <NA>

Jim · Answer 4 · 2021-11-28T18:50:48.630

library(dplyr)

A simple, easy to read example of the basic case is

df_id_and_type %>% filter(!is.na(type)) %>%  
                   filter(id != lead(id) | id == max(id))

extending to the second criteria

df_id_year_and_type %>% filter(!is.na(type)) %>%  
                        filter((id != lead(id) | id == max(id)) & 
                                (year != lead(year) | year == max(year)))

It is clear and easy to understand. If you wish to retain the distinct groupings with no result you can either merge distinct back or insert another OR clause in the filters

score 0 · Answer 5 · answered Nov 29 '21 at 10:59

Why not using a simple max ?

setDT(df_id_year_and_type)
df_id_year_and_type[,max(type, na.rm=T), by=.(id, year)]

You will get a warning when there is only NA and the option na.rm is TRUE, but you can suppress it easily:

df_id_year_and_type[,suppressWarnings(max(type, na.rm=T)), by=.(id, year)]

Or alternatively, test if all values are NA:

df_id_year_and_type[,ifelse(all(is.na(type)), NA_character_, max(type, na.rm=T)), by=.(id, year)]

Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable

5 Answers5

method 1

method 2

method 3