This is a slight variation of the question that was answered previously in SO. (Unique on a dataframe with only selected columns)
The only difference from that question and mine is that I have to mention which specific rows from the duplicates should be retained. My rows are names I am thinking something such as to give a substring to delete the rows which have that substring but I am unable to put it into codes. For eg: if duplicate rows are exm123 and tre123, I want to retain the ones with tre substring)
If you guys think without any substring there are more easy ways to do the same in R, I am more than happy to learn the alternative. Thanks.
dat:
Index Name id1 id2
1 exm-9980 1 202183358
2 exm-53487 1 203186865
3 exm-tre10248 1 85537661
4 exm-7747 10 102827758
5 exm-29639 10 18289634
6 exm-76467 10 27436462
7 exm-tre7540 10 18289634
8 exm-4560589 10 74890584
9 vg-194357 11 102589148
10 exm-0867390 11 61110815
11 exm-IN3127 1 85537661
12 exm-tre2315 11 18632984
13 exm-12411 6 30332555
14 exm-128711 11 18632984
nm1 <- c('id1', 'id2')
indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1],fromLast=TRUE)
df22=dat[!indx|(indx & grepl("^tre", dat$Name)),]
which(indx==T)
indx: 3,5,7,12.14,11,13
when I cross check using values from id1 and id2 from the main data for index 13
f1=dat[dat$id1==6& dat$id2==30332555,]
f1 is a matrix of 1 row. if it is a duplicate it should be a matrix of rows 2 or more.
I am unable to load the full data as it is more than 100k rows. But I hope this will help in showing the problem in a clear way.