pick up specific rows from duplicates on a dataframe with only selected columns

Question

This is a slight variation of the question that was answered previously in SO. (Unique on a dataframe with only selected columns)

The only difference from that question and mine is that I have to mention which specific rows from the duplicates should be retained. My rows are names I am thinking something such as to give a substring to delete the rows which have that substring but I am unable to put it into codes. For eg: if duplicate rows are exm123 and tre123, I want to retain the ones with tre substring)

If you guys think without any substring there are more easy ways to do the same in R, I am more than happy to learn the alternative. Thanks.

  dat:    
  Index Name      id1   id2
  1 exm-9980        1   202183358
  2 exm-53487       1   203186865
  3 exm-tre10248    1   85537661
  4 exm-7747       10   102827758
  5 exm-29639      10   18289634
  6 exm-76467      10   27436462
  7 exm-tre7540    10   18289634
  8 exm-4560589    10   74890584
  9 vg-194357      11   102589148
  10 exm-0867390   11   61110815
  11 exm-IN3127     1   85537661
  12 exm-tre2315   11   18632984
  13 exm-12411      6   30332555
  14 exm-128711    11   18632984

nm1 <- c('id1', 'id2')           
indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1],fromLast=TRUE)    
df22=dat[!indx|(indx & grepl("^tre", dat$Name)),]    
which(indx==T)       

indx: 3,5,7,12.14,11,13

when I cross check using values from id1 and id2 from the main data for index 13
f1=dat[dat$id1==6& dat$id2==30332555,]
f1 is a matrix of 1 row. if it is a duplicate it should be a matrix of rows 2 or more.

I am unable to load the full data as it is more than 100k rows. But I hope this will help in showing the problem in a clear way.

Please provide a small sample of your data and show what you expect as a result. — Rich Scriven, Oct 15 '14 at 03:19
@user2698508 Suppose if there are duplicates and both of them starts with `tre` (as shown in my example). Then, which one would you retain? — akrun, Oct 15 '14 at 03:54
@user2698508 I get the expected result from your new example by using the previous code — akrun, Oct 15 '14 at 07:10
@user2698508 In the new dataset you have `exm` and `trm` in the single `Name` row. Here, which row should be in the expected output. — akrun, Oct 15 '14 at 09:53
@user2698508 Also `grepl(^rs", dat$Name)`, I couldn't find `rs` as starting characters in the Name column. — akrun, Oct 15 '14 at 09:58
@Akrun, Edited the code.It should be grepl("^tre",dat$Name). Yes as seen between index 12 and 14, then I should retain exm-tre. — user2698508, Oct 16 '14 at 02:14
@user2698508 Please check the update. I guess the problem was because in the new example, the `tre` is in the middle of the string in `Name` column. — akrun, Oct 16 '14 at 03:40
@Akrun, Works perfect. Thanks a lot for the great help. I do not have enough credits to upvote but I have accepted the answer using the tick. — user2698508, Oct 16 '14 at 08:05

akrun · Accepted Answer · 2014-10-16T03:36:24.403

0

Using the example dataset:

 nm1 <- c('id1', 'id2')
 indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1], fromLast=TRUE)

dat[!indx|(indx & grepl("^tre", dat$Name)),] 
 #   Index      Name id1    id2
 #1     1 exm-49980   1   2021
 #2     2  exm-3487   1  20318
 #3     3  exm-0248   1   8553
 #4     4 exm-17747  10 102827
 #5     5 exm-29639  10  18289
 #7     7  tre-2987  10  27436
 #8     8  vg-18999  18 279990

data

 dat <- structure(list(Index = 1:8, Name = c("exm-49980", "exm-3487", 
 "exm-0248", "exm-17747", "exm-29639", "exm-6467", "tre-2987", 
 "vg-18999"), id1 = c(1L, 1L, 1L, 10L, 10L, 10L, 10L, 18L), id2 = c(2021L, 
 20318L, 8553L, 102827L, 18289L, 27436L, 27436L, 279990L)), .Names = c("Index", 
 "Name", "id1", "id2"), class = "data.frame", row.names = c(NA, 
-8L))

Update

 nm1 <- c('id1', 'id2')
 indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1], fromLast=TRUE)

 dat1 <- dat[!indx|(indx&grepl("-tre", dat$Name)),] #check the `grepl`.  The pattern is changed as per the new example.  Here, the `Name` no longer starts with `tre`.
 dat1
 #    Index         Name id1       id2
 #1      1     exm-9980   1 202183358
 #2      2    exm-53487   1 203186865
 #3      3 exm-tre10248   1  85537661
 #4      4     exm-7747  10 102827758
 #6      6    exm-76467  10  27436462
 #7      7  exm-tre7540  10  18289634
 #8      8  exm-4560589  10  74890584
 #9      9    vg-194357  11 102589148
 #10    10  exm-0867390  11  61110815
 #12    12  exm-tre2315  11  18632984
 #13    13    exm-12411   6  30332555

data

 dat <- structure(list(Index = 1:14, Name = c("exm-9980", "exm-53487", 
 "exm-tre10248", "exm-7747", "exm-29639", "exm-76467", "exm-tre7540", 
 "exm-4560589", "vg-194357", "exm-0867390", "exm-IN3127", "exm-tre2315", 
 "exm-12411", "exm-128711"), id1 = c(1L, 1L, 1L, 10L, 10L, 10L, 
 10L, 10L, 11L, 11L, 1L, 11L, 6L, 11L), id2 = c(202183358L, 203186865L, 
 85537661L, 102827758L, 18289634L, 27436462L, 18289634L, 74890584L, 
 102589148L, 61110815L, 85537661L, 18632984L, 30332555L, 18632984L
 )), .Names = c("Index", "Name", "id1", "id2"), class = "data.frame", row.names = c(NA, 
 -14L))

edited Oct 16 '14 at 03:36

answered Oct 15 '14 at 03:48

akrun

874,273
37
540
662

Hi Akrun, In such cases I can retain any but I have not encountered rows with same prefix for duplicates yet. I am unable to understand the dat[!indx |(indx & grepl("^tre", row.names(dat))),] line of your code and I see that the result still have exm. Can you tell me waht | and ^ are used for in this line, it will help me modify the code accordingly. I am unable to find it in the syntax of grepl. – user2698508 Oct 15 '14 at 06:47
@user2698508 I was just creating a dataset based on what you mentioned. Also, I looked at the link you showed. So, basically, what I understood is that you wantted to subset all non-duplicated rows and if there is any duplicated row, then take the one with `rownames` that start with `tre`. Here, in this case, the `exm145` row is non-duplicated. It might be better if you show an example dataset and the expected result so that there will be less confusion. – akrun Oct 15 '14 at 06:50
Yeah you got it perfect. :) I should subset the non duplicated rows and for the duplicated rows, I should retain the ones with do not have exm prefix in their row name. eg: exm123, tre123, exm145, tre133, vg189 are my row names, exm123, tre123,vg189 are non duplicates. exm145,tre133 are duplicates then i retain tre133. I hope I am able to say it more clearly with an example. – user2698508 Oct 15 '14 at 06:56
I tried the updated code. It works for some indices and does not work for some. I found that some values (which are the row numbers) in indx are not a duplicate in my data. There is just one row with that id1 and id2 combination for those indx. – user2698508 Oct 15 '14 at 07:37
@user2698508 It's not clear from your description. In the example you provided, there are cases with only one row for id1 and id2 combination i.e. row 1:5 and 8. – akrun Oct 15 '14 at 08:23
yeah. So with the actual data the problem I face (which I will try to explain with the above sample data) is that, indx variable takes in index 5 also although it is not a duplicate. indx should ahve only value 6 according to this example but it shows value 5 and 6. – user2698508 Oct 15 '14 at 08:56
@user2698508. Not clear though. Could you change that it in the expected output. It is all confusing now. – akrun Oct 15 '14 at 08:58
,Sorry for the confusion. The expected output that I have shown in the question is what I am looking for. The indx in the update code should essentially have the index of the rows which are duplicates, so according to the example, indx should have the values 5,6. In the next step the index 6 is not included because it does not have "tre" prefix. have I understood that properly? But in reality, when i run the code, indx takes values 3,5,6 and 3 should not make it to indx because it is not having an duplicate and hence the row at index 3 is removed as it does not have the prefix "tre". – user2698508 Oct 15 '14 at 09:11
@user2698508 I guess you are mentioning that in the current example, the code works. But, not in other cases. If that is the problem, please post that example which doesn't work. Reading the description can be a bit confusing sometim – akrun Oct 15 '14 at 09:14
Ok I will update in the question with my codes as well. – user2698508 Oct 15 '14 at 09:16

pick up specific rows from duplicates on a dataframe with only selected columns

1 Answers1

data

Update

data