6

I need some help to filter a dataframe.

The df has several columns and I want to split it into two dataframes:

1- One including only the rows in which the first column is a duplicate (including all of the replicas).

2- The rest of the rows, which are not duplicates.

Here is an example: This would be the original.

          V1  V2 
    [1,] "A" "1"
    [2,] "B" "1"
    [3,] "A" "1"
    [4,] "C" "2"
    [5,] "D" "3"
    [6,] "D" "4"

I want to turn into this:

         V1  V2 
   [1,] "A" "1"
   [2,] "A" "1"
   [3,] "D" "3"
   [4,] "D" "4"

And this:

        V1  V2 
  [1,] "B" "1"
  [2,] "C" "2"

Is there a way to do that? I have tried exporting to Excel, but the dataset was too large to make that viable.

Thank you

4 Answers4

9

Considering df as your input, you can use dplyr and try:

df %>% group_by(V1) %>% filter(n() > 1)

for the duplicates

and

df %>% group_by(V1) %>% filter(n() == 1)

for the unique entries.

Aramis7d
  • 2,444
  • 19
  • 25
2

You can use duplicated but bear in mind that duplicated only returns TRUE at the first duplicated value, i.e.

> duplicated(c("A", "A", "A"))
[1] FALSE  TRUE  TRUE 

does not return TRUE for the first "A". If you want to catch all values of "A" you can e.g. use

duplicated(c("A", "A", "A")) | duplicated(c("A", "A", "A"), fromLast = TRUE)
# [1] TRUE TRUE TRUE

You can then separate your data using

## Index of the duplicated values:
indDuplicatedVec <- duplicated(d[,1]) | duplicated(d[,1], fromLast = TRUE)

myDuplicates <- d[indDuplicatedVec, ]
myUniques <- d[!indDuplicatedVec, ]

> myDuplicates
#V1 V2
#1  A  1
#3  A  1
#5  D  3
#6  D  4

> myUniques
#V1 V2
#2  B  1
#4  C  2
ikop
  • 1,760
  • 1
  • 12
  • 24
1

We can use data.table

library(data.table)
setDT(df)[, .SD[.N >1], V1]
akrun
  • 874,273
  • 37
  • 540
  • 662
-1

Try

d[!duplicated(d),]

and

d[duplicated(d),]

where d is your database.

=== UPDATE === If only the first column is desired, and all duplicates need to go in a separate column, you could do:

library(gdata) d[duplicated2(d$V1,bothWays = T),] d[!duplicated2(d$V1,bothWays = T),]

If only base R is desired, then:

bm <- duplicated(d$V1) | duplicated(d$V1,fromLast = TRUE) d[bm,] d[!bm,]

Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • It is not working. Using that code, the new datasets will not have both of the values of D, for instance. I want a code that makes possible getting the same original dataset if i rbind the 2 split dataframes again. – Arthur Carvalho Brito Apr 20 '17 at 04:56
  • The problem here is that you're considering all the columns. Also, OP does not want to retain any instances of duplicated columns in the unique set. – Aramis7d Apr 20 '17 at 04:57
  • Oh, I see. There was no example in the original version of the question. I will edit the answer – Marat Talipov Apr 20 '17 at 13:17