Conditionally dropping duplicates from a data.frame

Question

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.

Suppose my dataset is:

dat <- read.table(text = "
        id     s          
        1      2     
        1      2     
        1      1      
        1      3     
        1      3     
        1      3     
        2      3     
        2      3     
        3      2     
        3      2", 
header=TRUE)

What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:

I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?

There are also duplicates in id=1 where s=2 and id=3 where s=2, do you want to keep these or remove them as well? — Manar Bushnaq, Jan 12 '13 at 01:27

score 2 · Accepted Answer · edited May 23 '17 at 12:07

2

Like this:

subset(dat, s != 3 | s == 3 & !duplicated(dat)) 
#    id s
# 1   1 2
# 2   1 2
# 3   1 1
# 4   1 3
# 7   2 3
# 9   3 2
# 10  3 2

Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:

dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

edited May 23 '17 at 12:07

Community

1
1

answered Jan 12 '13 at 01:37

flodel

87,577
21
185
223

Thanks a lot! Also for the link – Stefano Lombardi Jan 12 '13 at 01:41
Sorry, I have no idea, I'd have to look at your data. – flodel Jan 12 '13 at 02:36

Conditionally dropping duplicates from a data.frame

1 Answers1