I have been looking for an efficient way of counting and removing duplicate rows in a data frame while keeping the index of their first occurrences. For example, if I have a data frame:
df<-data.frame(x=c(9.3,5.1,0.6,0.6,8.5,1.3,1.3,10.8),y=c(2.4,7.1,4.2,4.2,3.2,8.1,8.1,5.9))
ddply(df,names(df),nrow)
gives me
x y V1
1 0.6 4.2 2
2 1.3 8.1 2
3 5.1 7.1 1
4 8.5 3.2 1
5 9.3 2.4 1
6 10.8 5.9 1
But I want to keep the original indices (along with the row names) of the duplicated rows. like:
x y V1
1 9.3 2.4 1
2 5.1 7.1 1
3 0.6 4.2 2
5 8.5 3.2 1
6 1.3 8.1 2
8 10.8 5.9 1
"duplicated" returns the original rownames (here {1 2 3 5 6 8}) but doesnt count the number of occurences. I tried writing functions on my own but none of them are efficient enough to handle big data. My data frame can have up to couple of million rows (though columns are usually 5 to 10).