Filtering a dataframe showing only duplicates

Question

I need some help to filter a dataframe.

The df has several columns and I want to split it into two dataframes:

1- One including only the rows in which the first column is a duplicate (including all of the replicas).

2- The rest of the rows, which are not duplicates.

Here is an example: This would be the original.

          V1  V2 
    [1,] "A" "1"
    [2,] "B" "1"
    [3,] "A" "1"
    [4,] "C" "2"
    [5,] "D" "3"
    [6,] "D" "4"

I want to turn into this:

         V1  V2 
   [1,] "A" "1"
   [2,] "A" "1"
   [3,] "D" "3"
   [4,] "D" "4"

And this:

        V1  V2 
  [1,] "B" "1"
  [2,] "C" "2"

Is there a way to do that? I have tried exporting to Excel, but the dataset was too large to make that viable.

Thank you

Can you provide a reproducible data set, and the desired result? http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Pierre Lapointe, Apr 20 '17 at 04:24

Aramis7d · Accepted Answer · 2017-04-20T05:03:49.050

9

Considering df as your input, you can use dplyr and try:

df %>% group_by(V1) %>% filter(n() > 1)

for the duplicates

and

df %>% group_by(V1) %>% filter(n() == 1)

for the unique entries.

edited Apr 20 '17 at 05:03

answered Apr 20 '17 at 04:55

Aramis7d

2,444
19
25

score 2 · Answer 2 · answered Apr 20 '17 at 05:01

You can use duplicated but bear in mind that duplicated only returns TRUE at the first duplicated value, i.e.

> duplicated(c("A", "A", "A"))
[1] FALSE  TRUE  TRUE

does not return TRUE for the first "A". If you want to catch all values of "A" you can e.g. use

duplicated(c("A", "A", "A")) | duplicated(c("A", "A", "A"), fromLast = TRUE)
# [1] TRUE TRUE TRUE

You can then separate your data using

## Index of the duplicated values:
indDuplicatedVec <- duplicated(d[,1]) | duplicated(d[,1], fromLast = TRUE)

myDuplicates <- d[indDuplicatedVec, ]
myUniques <- d[!indDuplicatedVec, ]

> myDuplicates
#V1 V2
#1  A  1
#3  A  1
#5  D  3
#6  D  4

> myUniques
#V1 V2
#2  B  1
#4  C  2

score 1 · Answer 3 · answered Apr 20 '17 at 05:37

1

We can use data.table

library(data.table)
setDT(df)[, .SD[.N >1], V1]

answered Apr 20 '17 at 05:37

akrun

874,273
37
540
662

Marat Talipov · Answer 4 · 2017-04-20T13:22:17.173

-1

Try

d[!duplicated(d),]

and

d[duplicated(d),]

where d is your database.

=== UPDATE === If only the first column is desired, and all duplicates need to go in a separate column, you could do:

library(gdata) d[duplicated2(d$V1,bothWays = T),] d[!duplicated2(d$V1,bothWays = T),]

If only base R is desired, then:

bm <- duplicated(d$V1) | duplicated(d$V1,fromLast = TRUE) d[bm,] d[!bm,]

edited Apr 20 '17 at 13:22

answered Apr 20 '17 at 04:31

Marat Talipov

13,064
5
34
53

It is not working. Using that code, the new datasets will not have both of the values of D, for instance. I want a code that makes possible getting the same original dataset if i rbind the 2 split dataframes again. – Arthur Carvalho Brito Apr 20 '17 at 04:56
The problem here is that you're considering all the columns. Also, OP does not want to retain any instances of duplicated columns in the unique set. – Aramis7d Apr 20 '17 at 04:57
Oh, I see. There was no example in the original version of the question. I will edit the answer – Marat Talipov Apr 20 '17 at 13:17

Filtering a dataframe showing only duplicates

4 Answers4

Linked

Related