R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

Question

The real life problem: I have subjects with MRI scan data. Some of which have been scanned multiple times (separate rows). Some of those were scanned under different protocols each time. I want to keep all unique rows by subject ID, and if a subject was scanned under two different protocols, I want it to prefer one over the other.

The toy example:

library(dplyr)  
df <- tibble(
        id = c("A", "A", "B", "C", "C", "D"), 
        protocol = c("X", "Y", "X", "X", "X", "Y"),
        date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"), 
                 by="days")),
        var = 1:6)

I want to return a data frame with all unique subjects by id. When it comes to a duplicate value, instead of automatically keeping the first entry, I want it to keep the entry with "Y" as the protocol if it has that choice, but not to get rid of rows with "X" otherwise.

In the example, it would keep rows 2, 3, 4, and 6.

I prefer dplyr, but am open to other suggestions.

Nothing that I've tried even begins to work:

df %>% distinct(id, .keep_all = TRUE) #Nope! 

df %>% distinct(id, protocol == "Y", .keep_all = TRUE) #Nope!  

df$protocol <- factor(df$protocol, levels = c("Y", "X"))
df %>% distinct(id, .keep_all = TRUE) #Nope!  

df %>% group_by(id) %>% filter(protocol == "Y") #Nope!

Two good answers: @RobJensen suggests

df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)

If I have multiple protocols and wish to assign an order to which they will be chosen, I can create a new variable where I assign the protocols an integer in order of preference, then use the suggestion from @joran

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% slice(1)

Thanks!

+1 for posting a clearly described question and desired result, with a small example and the code you have tried, in your first SO post! Cheers. — Henrik, Jan 22 '18 at 21:21

IceCreamToucan · Accepted Answer · 2018-01-22T21:51:16.847

Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference variable to give an ordering of what you'd prefer to be selected if Y isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically.

Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector

order_of_importance <- c("Y", "Z", "X")

    df2 %>%
      mutate(protocol = factor(protocol, order_of_importance)) %>%
      arrange(id, protocol) %>%
      distinct(id, .keep_all = TRUE)

Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do

df %>% 
    arrange(id, desc(protocol == 'Y')) %>% 
    distinct(id, .keep_all = TRUE)

score 6 · Answer 2 · answered Jan 22 '18 at 20:33

6

You can achieve this without using group_by() if you want the output to be a tibble that is not a grouped_df.

df %>% arrange(id, desc(protocol)) %>% distinct(id, .keep_all = TRUE)

answered Jan 22 '18 at 20:33

davechilders

8,693
2
18
18

1

In this case sorting the protocol variables alphabetically is conveniently the right ordering, but in the more general case e.g. if there was a "Z" protocol but the "Y" still needs to be selected you can do `df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)` – IceCreamToucan Jan 22 '18 at 20:39
@RobJensen That works and is the simplest answer (so far). If you post it as an answer, I can choose it. Thanks! – Bart Jan 22 '18 at 20:47

joran · Answer 3 · 2018-01-22T20:46:12.220

4

There's probably a faster way (almost certainly with data.table) but this would be the naive direct approach in dplyr I think:

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% do(head(.,1))

As @Gregor noted below (now deleted), slice(1) is probably a better idiom for do(head(.,1)).

edited Jan 22 '18 at 20:46

answered Jan 22 '18 at 20:29

joran

169,992
32
429
468

That worked when I took out desc(). My real data has multiple different protocols with different names. What I can do is create another variable where I assign each protocol a number based on preference, then sort by that and slice the first one. Thanks! – Bart Jan 22 '18 at 20:42

score 1 · Answer 4 · answered Jan 22 '18 at 20:46

You could break the process into two steps: grab the must-haves, grab whatever for the other IDs, and combine.

distinct_y <- df %>%
  filter(protocol == "Y") %>%
  distinct(id, .keep_all = TRUE)

distinct_other <- df %>%
  anti_join(distinct_y, "id") %>%
  distinct(id, .keep_all = TRUE)

distinct_combined <- rbind(distinct_y, distinct_other)

If you'd like to generalize it from a "one above all" to an ordering of value, I suggest making protocol a factor.

For example, suppose there are three protocols: X, Y, and Z. Y is the best, Z is better than X, and you only want X if there's nothing better.

# Only difference is the best protocol for C will now be Z.
df2 <- tibble(
  id = c("A", "A", "B", "C", "C", "D"),
  protocol = c("X", "Y", "X", "X", "Z", "Y"),
  date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"),
               by="days")),
  var = 1:6
)

order_of_importance <- c("Y", "Z", "X")

df2 %>%
  mutate(protocol = factor(protocol, order_of_importance)) %>%
  group_by(id) %>%
  arrange(protocol) %>%
  slice(1)
# # A tibble: 4 x 4
# # Groups: id [4]
#   id    protocol date         var
#   <chr> <fctr>   <date>     <int>
# 1 A     Y        2018-01-02     2
# 2 B     X        2018-01-03     3
# 3 C     Z        2018-01-05     5
# 4 D     Y        2018-01-06     6

R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?

4 Answers4

Linked