I want to keep distinct rows on a data frame, with an algorithm that chooses the last value per group (as dplyr::distinct()
does by default), but only if it's not NA
. I've seen this great answer on SO that relies on data.table
, but I can't scale it to data with more than one grouping variable.
To demonstrate the problem, I start with the minimal example that does work, then scale it up. So first, consider the following data:
library(tibble)
df_id_and_type <-
tibble::tribble(
~id, ~type,
1, "A",
1, NA,
2, "B",
3, "A",
3, NA,
3, "D",
3, NA,
4, NA,
4, "C",
5, "A",
6, NA,
6, "B",
6, NA
)
I want to get the distinct type
values per id
, by choosing the last value unless it's NA
. If the last is NA
then go up until there's non-NA
. So this answer shows us how to do it with data.table
:
library(data.table)
dt_id_and_type <- as.data.table(df_id_and_type)
dt_id_and_type$typena <- is.na(dt_id_and_type$type)
setorderv(dt_id_and_type, c("typena","id"), order = c(-1, 1))
dt_id_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE]
#> id type
#> 1: 1 A
#> 2: 2 B
#> 3: 3 D
#> 4: 4 C
#> 5: 5 A
#> 6: 6 B
But what to do if we have more than one grouping variable (i.e, not only id
)? In the following example I add a year
variable:
df_id_year_and_type <-
df_id_and_type %>%
add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012),
.before = "type")
df_id_year_and_type
#> # A tibble: 13 x 3
#> id year type
#> <dbl> <dbl> <chr>
#> 1 1 2002 A
#> 2 1 2002 <NA>
#> 3 2 2008 B
#> 4 3 2010 A
#> 5 3 2010 <NA>
#> 6 3 2010 D
#> 7 3 2013 <NA>
#> 8 4 2020 <NA>
#> 9 4 2020 C
#> 10 5 2009 A
#> 11 6 2010 <NA>
#> 12 6 2010 B
#> 13 6 2012 <NA>
My expected output would be:
## # A tibble: 8 x 3
## id year type
## <dbl> <dbl> <chr>
## 1 1 2002 A
## 2 2 2008 B
## 3 3 2010 D
## 4 3 2013 NA # for id 3 in year 2013 there was only `NA`, so that's what we get
## 5 4 2020 C
## 6 5 2009 A
## 7 6 2010 B
## 8 6 2012 NA # same as comment above
Any idea how I could scale the solution that worked in 1-grouping-var case to the current data? The first 2 lines of code are no-brainer:
dt_id_year_and_type <- as.data.table(df_id_year_and_type)
dt_id_year_and_type$typena <- is.na(dt_id_year_and_type$type)
setorderv(dt_id_year_and_type, c("typena","id"), order = c(-1, 1)) # <--- how to account for `year`?
dt_id_year_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE] # <--- here too...