R - Count duplicated rows keeping index of their first occurrences

Question

I have been looking for an efficient way of counting and removing duplicate rows in a data frame while keeping the index of their first occurrences. For example, if I have a data frame:

df<-data.frame(x=c(9.3,5.1,0.6,0.6,8.5,1.3,1.3,10.8),y=c(2.4,7.1,4.2,4.2,3.2,8.1,8.1,5.9))
ddply(df,names(df),nrow)

gives me

    x   y   V1
1  0.6 4.2  2
2  1.3 8.1  2
3  5.1 7.1  1
4  8.5 3.2  1
5  9.3 2.4  1
6 10.8 5.9  1

But I want to keep the original indices (along with the row names) of the duplicated rows. like:

    x   y   V1
1  9.3 2.4  1
2  5.1 7.1  1
3  0.6 4.2  2
5  8.5 3.2  1
6  1.3 8.1  2
8 10.8 5.9  1

"duplicated" returns the original rownames (here {1 2 3 5 6 8}) but doesnt count the number of occurences. I tried writing functions on my own but none of them are efficient enough to handle big data. My data frame can have up to couple of million rows (though columns are usually 5 to 10).

I'd guess you took your solution from the duplicated post. I wonder though why haven't you looked further down on the rest of the solutions. — David Arenburg, Nov 26 '15 at 10:12
I did look at many solutions but could'nt find the "keeping first occurances of the duplicates" any where. Since I don't have a privilege of adding comments in older posts, and asking a 'question' in the 'answer' box would have been wrong, I had to create a new post. I asked this question after struggling with it for 6 days. — Ira, Nov 26 '15 at 10:29

akrun · Answer 1 · 2015-11-26T10:38:03.787

2

We can try with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', 'y' column, we get the nrow (.N ).

library(data.table)
setDT(df)[, list(V1=.N), by = .(x,y)]
#      x   y V1
#1:  9.3 2.4  1
#2:  5.1 7.1  1
#3:  0.6 4.2  2
#4:  8.5 3.2  1
#5:  1.3 8.1  2
#6: 10.8 5.9  1

If we need the row ids,

setDT(df)[, list(V1= .N, rn=.I[1L]), by = .(x,y)]
#      x   y V1 rn
#1:  9.3 2.4  1  1
#2:  5.1 7.1  1  2
#3:  0.6 4.2  2  3
#4:  8.5 3.2  1  5
#5:  1.3 8.1  2  6
#6: 10.8 5.9  1  8

Or

setDT(df, keep.rownames=TRUE)[, list(V1=.N, rn[1L]), .(x,y)]

edited Nov 26 '15 at 10:38

answered Nov 26 '15 at 10:09

akrun

874,273
37
540
662

Thanks. But it does'nt give the row names back. I would like the rownames to be {1 2 3 5 6 8}. – Ira Nov 26 '15 at 10:18
1

@Ira Updated the post – akrun Nov 26 '15 at 10:28
Thanks. And sorry I have already accepted an answer. Thanks so much for a prompt solution. – Ira Nov 26 '15 at 10:31
1

@DavidArenburg You are right. I was working on a different direction before I changed gears. – akrun Nov 26 '15 at 10:37

score 2 · Accepted Answer · edited Nov 26 '15 at 10:34

2

If you want to keep the index:

library(data.table)
setDT(df)[,.(.I, .N), by = names(df)][!duplicated(df)]
#      x   y I N
#1:  9.3 2.4 1 1
#2:  5.1 7.1 2 1
#3:  0.6 4.2 3 2
#4:  8.5 3.2 5 1
#5:  1.3 8.1 6 2
#6: 10.8 5.9 8 1

Or using data.tables unique method

unique(setDT(df)[,.(.I, .N), by = names(df)], by = names(df))

edited Nov 26 '15 at 10:34

David Arenburg

91,361
17
137
196

answered Nov 26 '15 at 10:14

Colonel Beauvel

30,423
11
47
87

Great! This is what I wanted. Thanks. I'll update on the efficiency issue I was facing with other methods comparing them with this. – Ira Nov 26 '15 at 10:24

R - Count duplicated rows keeping index of their first occurrences

2 Answers2