Selecting unique values form one column in dataframe while keeping random corresponding value from another?

Question

This has been racking my brain! I have a dataframe with one column that contains duplicate values and another that does not contain duplicates, however any of these corresponding values are valuable to keep when selecting unique values in the first column. For example:

 data <- data.frame(a = c(2,4,4,6,3,6,4,3,3,2,2), b = c("a", "b", "c", "a", "f", "e", "p", "e", "u", "c", "f"))

If I do something like:

 res <- unique(data[c("a", "b")])

The result has to produce unique values in column a, but b can choose any of the corresponding values of the unique value to keep or discard. The result has to do something like this:

res <- data.frame(a = c(2,4,6,3), b = c("a", "b", "a", "f"))

Any help would be appreciated!

akrun · Accepted Answer · 2020-04-10T21:14:32.787

We can use sample_n after grouping by 'a'

library(dplyr)
data %>% 
   group_by(a) %>%
   sample_n(1)
# A tibble: 4 x 2
# Groups:   a [4]
#      a b    
#  <dbl> <fct>
#1     2 a    
#2     3 e    
#3     4 c    
#4     6 a

Or use slice with sample on the row_number()

data %>%
  group_by(a) %>%
  slice(sample(row_number(), 1))

Or if we need to keep only the 'b' column (considering that there are many other columns)

data %>%
  group_by(a) %>%
  summarise(b = sample(b, 1))

Selecting unique values form one column in dataframe while keeping random corresponding value from another?

1 Answers1