Remove duplicates based on a specific criteria

Question

I have a dataset with programme participants, where I would like to remove clients who participated in two (or more) different programmes. I tried to work around this task by selecting clients with distinct IDs, and clients who participated more than once in the same programme, and then hoping to join these two subsets, but I keep failing. Here is my attempt on a small sample:

Client.ID = c(1, 2, 2, 3, 3)
Program = c("P1", "P2", "P2", "P1", "P2")
From = as.Date(c("2016-10-01", "2016-01-25", "2016-07-01", "2016-04-06", "2016-05-01")) 
To = as.Date(c("2017-03-31", "2016-06-30", "2016-11-27", "2016-06-03", "2016-06-02")) 
df = tibble(Client.ID, Program, From, To)

df
# A tibble: 5 x 4
  Client.ID Program       From         To
      <dbl>   <chr>     <date>     <date>
1         1      P1 2016-10-01 2017-03-31
2         2      P2 2016-01-25 2016-06-30
3         2      P2 2016-07-01 2016-11-27
4         3      P1 2016-04-06 2016-06-03
5         3      P2 2016-05-01 2016-06-02


distinct(df, Client.ID, .keep_all = TRUE) #this selects distinct cases nicely
df[duplicated(df$Client.ID) & !duplicated(df$Program),] #but this does not work

I tried to look at other questions on the forum, but I only found ones, eg., here and here, that deal with duplicates based on two variables. I want to do something else: to remove cases that are duplicates in the first variable, but have different values in the second variable. Clients who participated in the same programme multiple times, should stay in the dataset.

Many thanks for any help!

score 1 · Accepted Answer · answered Nov 18 '17 at 17:18

1

with library(dplyr):

df %>% group_by(Client.ID) %>%
    filter(length(unique(Program)) < 2 )

answered Nov 18 '17 at 17:18

Nate

10,361
3
33
40

It took me a while to understand it, but I got it! Thank you so much. – malasi Nov 18 '17 at 20:46

score 0 · Answer 2 · answered Nov 18 '17 at 19:03

0

Here's a base R version:

df[!duplicated(unique(df[, 1:2])[, 1]), ]

In words, find the unique pairs and remove duplicates of "Client.ID".

answered Nov 18 '17 at 19:03

Nate A

71
8

This one does not seem to work because it retains client 3 in the data frame. – malasi Nov 18 '17 at 20:47

Remove duplicates based on a specific criteria

2 Answers2