R - Remove combinations of variables that occur more than once in a data.frame

Question

Say I have a dataframe, df, with three vectors:

  colours   individual value
1   white individual 1   0.4
2   white individual 1   0.7
3   black individual 2   1.1
4   black individual 3   0.5

Sometimes the same person shows up multiple times for the same colour but different values. I would like to write some code that would delete all of the instances in which this happens.

***EDIT: There are many more rows than 4 - millions - I don't think the current solutions work.

I would like to count how many times the string I am currently on, in my for loop, comes up and then delete them from the data.frame. So in the example above, I would like to get rid of individual 1. The df would then leave the other two rows.

So far my approach was this:

Get a list of all the colours
Get a list of all the individuals
Write two for loops.

colours <- unique(df$colours) ind <- unique(df$individual) for (i in ind) { for (c in colour) { #something here. Probably if, asking if the person I'm on in the loop #is found with the colour I am on, more than once, get rid of them } }

My expected output is this:

colours  individual   value

black   individual 2   1.1

black   individual 3   0.5

Source data

df <- data.frame(colours = c("white", "white", "black", "black"),
                 individual = c("individual 1", "individual 1", "individual 2", "individual 3"),
                 value = c(0.4, 0.7, 1.1, 0.5))

Thanks for the suggestion, mtoto. That would work except there is another vector with values that are different. I can't tell which value is the correct one to take, so I am trying to just delete where I find two values for the same individual for the same colour. I will edit this into the question as I have just noticed how that complicates things. — Gotmadstacks, Apr 01 '16 at 12:24
Can someone unmark this as a duplicate? It's not. The rows are different. I changed the question to reflect this. Sorry for prior confusion. — Gotmadstacks, Apr 01 '16 at 12:27
Ok, I was only following mtoto's link. I will unmark it if that is not the case. But, I would say the `duplicated(...)|duplicated(..., fromLast=TRUE)` is a dupe. — akrun, Apr 01 '16 at 12:28
It is still a duplicate as far as I see it, just need to specify the cols as in `df[!(duplicated(df[1:2]) | duplicated(df[1:2], fromLast = TRUE)), ]` — mtoto, Apr 01 '16 at 12:31
Hard to believe this has not been answered elsewhere on SO, but I don't see any true duplicate questions, where the focus is just on a few columns. — Sam Firke, Apr 01 '16 at 13:05

RHertel · Answer 1 · 2016-04-13T12:17:16.940

4

You could try with anti_join() from the dplyr library:

library(dplyr)
anti_join(df1, df1[duplicated(df1[1:2]),], by="individual")
#  colours   individual value
#1   black individual 3   0.5
#2   black individual 2   1.1

edited Apr 13 '16 at 12:17

answered Apr 01 '16 at 12:39

RHertel

23,412
5
38
64

Hi, Does this not only get rid of the first two rows? I'm guessing that to get it to go over all the rows I need to modify so I get rid of it applying to only the first two columns? – Gotmadstacks Apr 04 '16 at 11:32
@Gotmadstacks It should eliminate any row in the data.frame which has a combination of entries in the columns one and two occurring more than once. – RHertel Apr 04 '16 at 16:55

score 2 · Answer 2 · answered Apr 01 '16 at 12:50

A straightforward dplyr approach would be to group as desired and filter for groups with fewer than 2 observations:

library(dplyr)
df %>%
  group_by(colours, individual) %>%
  filter(n() < 2)

Source: local data frame [2 x 3]
Groups: colours, individual [2]

  colours   individual value
   (fctr)       (fctr) (dbl)
1   black individual 2   1.1
2   black individual 3   0.5

score 1 · Answer 3 · answered Apr 01 '16 at 12:48

1

Here is another option using data.table

library(data.table)
setDT(df1)[, if(.N==1) .SD , .(colours, individual)]
#   colours   individual value
#1:   black individual 2   1.1
#2:   black individual 3   0.5

answered Apr 01 '16 at 12:48

akrun

874,273
37
540
662

score 1 · Accepted Answer · answered Apr 04 '16 at 12:24

On the basis of some suggestions in the comments, this answer worked best:

df[!(duplicated(df[,1:2]) | duplicated(df[,1:2], fromLast = TRUE)), ]

Slightly different to the comments. This specifies the columns rather than the rows and so achieves the result I wanted from the question (remove those rows where individual and colour are duplicated). More useful generally because the example data in the question is only four rows as opposed to millions.

This is the simplest approach without importing any package. — pedrosaurio, Dec 12 '19 at 13:23

score 0 · Answer 5 · answered Apr 01 '16 at 12:47

This should do. I created a sample dataset, added index vector to show that you save only the first occurence of a colour-user occurence. This works is your rownames are actual row-number.

## Data preparation
colours <- sample(c("red","blue","green","yellow"), size = 50, replace = T)
users <- sample(1:10, size=50, replace=T )
df <- data.frame(colours,users)
df$value <- runif(50)
df$index <- 1:50

## Keep only the first occurence
res <- unique(df[,1:2])
res$values <- df$value[as.integer(rownames(res))]

R - Remove combinations of variables that occur more than once in a data.frame

5 Answers5