Conditionally removing duplicates

Question

I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.

Specifically, I need to delete any row where size = 0 only if SampleID is duplicated.

SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)

I want to delete rows with:

Sample ID   size
a           0
d           0

And keep:

SampleID   size
a          1
b          1
b          2
b          3
c          0
d          1
e          0

Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.

Please read: https://stackoverflow.com/help/how-to-ask and create reproducible example. — An economist, Sep 08 '17 at 14:58
To clarify the last comment: the code you provided does not run. Test examples in a new R session before posting. — Frank, Sep 08 '17 at 15:18

M-- · Answer 1 · 2020-05-28T18:38:57.020

3

In dplyr we can do this using group_by and filter:

library(dplyr)

data %>% 
  group_by(SampleID) %>% 
  filter(!(size==0 & n() > 1)) # filter(size!=0 | n() == 1))

#> # A tibble: 7 x 2
#> # Groups:   SampleID [5]
#>   SampleID  size
#>   <fct>    <dbl>
#> 1 a            1
#> 2 b            1
#> 3 b            2
#> 4 b            3
#> 5 c            0
#> 6 d            1
#> 7 e            0

edited May 28 '20 at 18:38

answered May 27 '20 at 14:35

M--

25,431
8
61
93

deca · Answer 2 · 2017-09-08T15:31:58.103

0

A solution that works in base R without data.table and is easy to follow through for R starters:

#Find all duplicates

data$dup1 <- duplicated(data$SampleID)
data$dup2 <- duplicated(data$SampleID, fromLast = TRUE)
data$dup <- ifelse(data$dup1 == TRUE | data$dup2 == TRUE, 1, 0)


#Subset to relevant

data$drop <- ifelse(data$dup == 1 & data$size == 0, 1, 0)
data2 <- subset(data, drop == 0)

edited Sep 08 '17 at 15:31

answered Sep 08 '17 at 15:14

deca

730
1
8
24

score 0 · Accepted Answer · answered Sep 08 '17 at 15:25

Using data.table framework: Transform your set to data.table

require(data.table)
setDT(data)

Build a list of id where we can delete lines:

dropable_ids = unique(data[size != 0, SampleID])

Finaly keep lines that are not in the dropable list or with non 0 value

data = data[!(SampleID %in% dropable_ids & size == 0), ]

Please note that not( a and b ) is equivalent to a or b but data.table framework doesn't handle well or.

Hope it helps

Conditionally removing duplicates

3 Answers3

Linked