2

This seems so simple, but I can't figure it out.

Given this data frame

df=data.frame(
  x = c(12,12,165,165,115,148,148,155,155,521),
  y = c(54,54,122,122,215,108,108,655,655,151)  
)


 df
     x   y
1   12  54
2   12  54
3  165 122
4  165 122
5  115 215
6  148 108
7  148 108
8  155 655
9  155 655
10 521 151

Now, how can I get the rows that only exists once. That is row 5 and 10. The order of rows can be totally arbitrary, so checking for the "next" row is not an option. I tried many things but nothing worked on my data.frame which has ~40k rows.

I had one solution working on a subset (~1k rows) of my data.frame which took 3 minutes to process. Thus, my solution would require 120 minutes on my original data.frame which is not appropiate. Can somebody help?

Alex P
  • 1,574
  • 13
  • 28
agoldev
  • 2,078
  • 3
  • 23
  • 38
  • If you are just looking for a quick visual method, you could use `table(df$x, df$y)`. This will not return row number, but instead will return a matrix with `x` values in the rows and `y` values in the columns. – James Theobald Jul 28 '17 at 20:47
  • This question has been asked several times before, a quick Google search returned: https://stackoverflow.com/questions/38142890/find-unique-rows-in-a-data-frame-in-r, https://stackoverflow.com/questions/27408439/extracting-only-unique-rows-from-data-frame-in-r for instance. – Uwe Jul 29 '17 at 06:37

4 Answers4

6

Check duplicated from the beginning and end of the data frame, if none returns true, then select it:

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)),]

#     x   y
#5  115 215
#10 521 151
Psidom
  • 209,562
  • 33
  • 339
  • 356
1

A solution with table

library(dplyr)
table(df) %>% as.data.frame %>% subset(Freq ==1) %>% select(-3)

or with base as you said in comments you prefer not to load packages:

subset(as.data.frame(table(df)),Freq ==1)[,-3]

Also I think data.table is very fast for big data sets and filtering, so this may be worth trying too as you mentionned speed:

df2 <- copy(df)
df2 <- setDT(df2)[, COUNT := .N, by='x,y'][COUNT ==1][,c("x","y")]
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • `filter`, not `subset`, right? – Frank Jul 28 '17 at 20:56
  • I never quite got the difference, apart from the fact `subset` keeps the initial row numbers. `filter` can also be masked by other packages, so I usually use `subset`. Is it bad ? – moodymudskipper Jul 28 '17 at 20:59
  • Nope, not at all. I just thought you meant to do it dplyr-style. – Frank Jul 28 '17 at 21:01
  • Why is `subset` not dplyr style :) ? You mean that it's base function ? – moodymudskipper Jul 28 '17 at 21:05
  • Yeah, exactly :) Hadley excludes `subset` from his "grammar of data manipulation", then re-included it after rewriting it himself as `filter`. – Frank Jul 28 '17 at 21:06
  • 1
    this is relevant, in short filter discards row names is faster for bigger datasets (slower for smaller), and better to work with sql databases: https://stackoverflow.com/questions/39882463/difference-between-subset-and-filter-from-dplyr/43144967#43144967 – moodymudskipper Jul 28 '17 at 21:09
  • Well, dplyr::filter is designed for tibbledy-dos, and I think those do not support row names, so I guess that makes sense. As far as speed goes, neither is designed to be used programmatically (instead `[` or `filter_`, I guess), so it might not be super important. – Frank Jul 28 '17 at 21:18
0

A solution using dplyr. df2 is the final output.

library(dplyr)
df2 <- df %>%
  count(x, y) %>%
  filter(n == 1) %>%
  select(-n)
www
  • 38,575
  • 12
  • 48
  • 84
  • Thanks for your answer. Not requiring another lib was a benefit, since this is used in a package we develop. – agoldev Jul 28 '17 at 20:53
0

Another base R solution that uses ave to calculate the total number of occurrences for each row and subsets only those that occur 1 time. It could also be modified for subsetting rows that occur a specific number of times.

df[ave(1:NROW(df), df, FUN = length) == 1,]
#     x   y
#5  115 215
#10 521 151
d.b
  • 32,245
  • 6
  • 36
  • 77