1

I have been looking for an efficient way of counting and removing duplicate rows in a data frame while keeping the index of their first occurrences. For example, if I have a data frame:

df<-data.frame(x=c(9.3,5.1,0.6,0.6,8.5,1.3,1.3,10.8),y=c(2.4,7.1,4.2,4.2,3.2,8.1,8.1,5.9))
ddply(df,names(df),nrow)

gives me

    x   y   V1
1  0.6 4.2  2
2  1.3 8.1  2
3  5.1 7.1  1
4  8.5 3.2  1
5  9.3 2.4  1
6 10.8 5.9  1

But I want to keep the original indices (along with the row names) of the duplicated rows. like:

    x   y   V1
1  9.3 2.4  1
2  5.1 7.1  1
3  0.6 4.2  2
5  8.5 3.2  1
6  1.3 8.1  2
8 10.8 5.9  1

"duplicated" returns the original rownames (here {1 2 3 5 6 8}) but doesnt count the number of occurences. I tried writing functions on my own but none of them are efficient enough to handle big data. My data frame can have up to couple of million rows (though columns are usually 5 to 10).

M--
  • 25,431
  • 8
  • 61
  • 93
Ira
  • 107
  • 8
  • I'd guess you took your solution from the duplicated post. I wonder though why haven't you looked further down on the rest of the solutions. – David Arenburg Nov 26 '15 at 10:12
  • I did look at many solutions but could'nt find the "keeping first occurances of the duplicates" any where. Since I don't have a privilege of adding comments in older posts, and asking a 'question' in the 'answer' box would have been wrong, I had to create a new post. I asked this question after struggling with it for 6 days. – Ira Nov 26 '15 at 10:29
  • Oh I see. I didn't notice that you want your row names too. – David Arenburg Nov 26 '15 at 10:30

2 Answers2

2

We can try with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', 'y' column, we get the nrow (.N ).

library(data.table)
setDT(df)[, list(V1=.N), by = .(x,y)]
#      x   y V1
#1:  9.3 2.4  1
#2:  5.1 7.1  1
#3:  0.6 4.2  2
#4:  8.5 3.2  1
#5:  1.3 8.1  2
#6: 10.8 5.9  1

If we need the row ids,

setDT(df)[, list(V1= .N, rn=.I[1L]), by = .(x,y)]
#      x   y V1 rn
#1:  9.3 2.4  1  1
#2:  5.1 7.1  1  2
#3:  0.6 4.2  2  3
#4:  8.5 3.2  1  5
#5:  1.3 8.1  2  6
#6: 10.8 5.9  1  8

Or

setDT(df, keep.rownames=TRUE)[, list(V1=.N, rn[1L]), .(x,y)]
akrun
  • 874,273
  • 37
  • 540
  • 662
2

If you want to keep the index:

library(data.table)
setDT(df)[,.(.I, .N), by = names(df)][!duplicated(df)]
#      x   y I N
#1:  9.3 2.4 1 1
#2:  5.1 7.1 2 1
#3:  0.6 4.2 3 2
#4:  8.5 3.2 5 1
#5:  1.3 8.1 6 2
#6: 10.8 5.9 8 1

Or using data.tables unique method

unique(setDT(df)[,.(.I, .N), by = names(df)], by = names(df))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
  • Great! This is what I wanted. Thanks. I'll update on the efficiency issue I was facing with other methods comparing them with this. – Ira Nov 26 '15 at 10:24