Find unique rows

Question

This seems so simple, but I can't figure it out.

Given this data frame

df=data.frame(
  x = c(12,12,165,165,115,148,148,155,155,521),
  y = c(54,54,122,122,215,108,108,655,655,151)  
)


 df
     x   y
1   12  54
2   12  54
3  165 122
4  165 122
5  115 215
6  148 108
7  148 108
8  155 655
9  155 655
10 521 151

Now, how can I get the rows that only exists once. That is row 5 and 10. The order of rows can be totally arbitrary, so checking for the "next" row is not an option. I tried many things but nothing worked on my data.frame which has ~40k rows.

I had one solution working on a subset (~1k rows) of my data.frame which took 3 minutes to process. Thus, my solution would require 120 minutes on my original data.frame which is not appropiate. Can somebody help?

If you are just looking for a quick visual method, you could use `table(df$x, df$y)`. This will not return row number, but instead will return a matrix with `x` values in the rows and `y` values in the columns. — James Theobald, Jul 28 '17 at 20:47
This question has been asked several times before, a quick Google search returned: https://stackoverflow.com/questions/38142890/find-unique-rows-in-a-data-frame-in-r, https://stackoverflow.com/questions/27408439/extracting-only-unique-rows-from-data-frame-in-r for instance. — Uwe, Jul 29 '17 at 06:37

score 6 · Accepted Answer · answered Jul 28 '17 at 20:43

6

Check duplicated from the beginning and end of the data frame, if none returns true, then select it:

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)),]

#     x   y
#5  115 215
#10 521 151

answered Jul 28 '17 at 20:43

Psidom

209,562
33
339
356

moodymudskipper · Answer 2 · 2017-07-28T21:25:17.537

1

A solution with table

library(dplyr)
table(df) %>% as.data.frame %>% subset(Freq ==1) %>% select(-3)

or with base as you said in comments you prefer not to load packages:

subset(as.data.frame(table(df)),Freq ==1)[,-3]

Also I think data.table is very fast for big data sets and filtering, so this may be worth trying too as you mentionned speed:

df2 <- copy(df)
df2 <- setDT(df2)[, COUNT := .N, by='x,y'][COUNT ==1][,c("x","y")]

edited Jul 28 '17 at 21:25

answered Jul 28 '17 at 20:43

moodymudskipper

46,417
11
121
167

`filter`, not `subset`, right? – Frank Jul 28 '17 at 20:56
I never quite got the difference, apart from the fact `subset` keeps the initial row numbers. `filter` can also be masked by other packages, so I usually use `subset`. Is it bad ? – moodymudskipper Jul 28 '17 at 20:59
Nope, not at all. I just thought you meant to do it dplyr-style. – Frank Jul 28 '17 at 21:01
Why is `subset` not dplyr style :) ? You mean that it's base function ? – moodymudskipper Jul 28 '17 at 21:05
Yeah, exactly :) Hadley excludes `subset` from his "grammar of data manipulation", then re-included it after rewriting it himself as `filter`. – Frank Jul 28 '17 at 21:06
1

this is relevant, in short filter discards row names is faster for bigger datasets (slower for smaller), and better to work with sql databases: https://stackoverflow.com/questions/39882463/difference-between-subset-and-filter-from-dplyr/43144967#43144967 – moodymudskipper Jul 28 '17 at 21:09
Well, dplyr::filter is designed for tibbledy-dos, and I think those do not support row names, so I guess that makes sense. As far as speed goes, neither is designed to be used programmatically (instead `[` or `filter_`, I guess), so it might not be super important. – Frank Jul 28 '17 at 21:18

score 0 · Answer 3 · answered Jul 28 '17 at 20:43

0

A solution using dplyr. df2 is the final output.

library(dplyr)
df2 <- df %>%
  count(x, y) %>%
  filter(n == 1) %>%
  select(-n)

answered Jul 28 '17 at 20:43

www

38,575
12
48
84

Thanks for your answer. Not requiring another lib was a benefit, since this is used in a package we develop. – agoldev Jul 28 '17 at 20:53

score 0 · Answer 4 · answered Jul 28 '17 at 20:52

Another base R solution that uses ave to calculate the total number of occurrences for each row and subsets only those that occur 1 time. It could also be modified for subsetting rows that occur a specific number of times.

df[ave(1:NROW(df), df, FUN = length) == 1,]
#     x   y
#5  115 215
#10 521 151

Find unique rows

4 Answers4