Randomly remove duplicated rows using dplyr()

Question

As a follow-up question to this one: Remove duplicated rows using dplyr, I have the following:

How do you randomly remove duplicated rows using dplyr() (among others)?

My command now is:

data.uniques <- distinct(data, KEYVARIABLE, .keep_all = TRUE)

But it returns the first occurrence of the KEYVARIABLE. I want that behaviour to be random: so anywhere between 1 and n occurrences of that KEYVARIABLE.

For instance:

KEYVARIABLE BMI
1 24.2
2 25.3
2 23.2
3 18.9
4 19
4 20.1
5 23.0

Currently my command returns:

KEYVARIABLE BMI
1 24.2
2 25.3
3 18.9
4 19
5 23.0

I want it to randomly return one of the n duplicated rows, for instance:

KEYVARIABLE BMI
1 24.2
2 23.2
3 18.9
4 19
5 23.0

score 8 · Accepted Answer · answered Aug 21 '17 at 20:08

8

One option would be to group by 'KEYVARIABLE' and then sample the sequence of rows to select the row and Subset the dataset

library(data.table)
setDT(df1)[, .SD[sample(.N)[1]], KEYVARIABLE]

Or using dplyr

library(dplyr)
df1 %>% 
   group_by(KEYVARIABLE) %>%
   sample_n(1)

answered Aug 21 '17 at 20:08

akrun

874,273
37
540
662

Consider big `.N`: `microbenchmark::microbenchmark(sample(1000, 1), sample(1000)[1])` – s_baldur Jun 12 '19 at 13:02

pogibas · Answer 2 · 2017-08-21T20:12:06.790

6

Just shuffle rows before selecting first occurrence (using distinct).

library(dplyr)
distinct(df[sample(1:nrow(df)), ], 
         KEYVARIABLE, 
         .keep_all = TRUE)

edited Aug 21 '17 at 20:12

answered Aug 21 '17 at 20:09

pogibas

27,303
19
84
117

Also `df1 %>% slice(sample(n())) %>% distinct(KEYVARIABLE, .keep_all = TRUE)` for those hooked on chains. – Frank Aug 21 '17 at 20:21

score 1 · Answer 3 · answered Aug 21 '17 at 20:13

1

By using dplyr

df%>%dplyr::mutate(A=sample(1:dim(df)[1]))%>%group_by(KEYVARIABLE)%>%dplyr::slice(which.min(A))

answered Aug 21 '17 at 20:13

BENY

317,841
20
164
234

Randomly remove duplicated rows using dplyr()

3 Answers3

Linked