3

Say I have a dataframe, df, with three vectors:

  colours   individual value
1   white individual 1   0.4
2   white individual 1   0.7
3   black individual 2   1.1
4   black individual 3   0.5

Sometimes the same person shows up multiple times for the same colour but different values. I would like to write some code that would delete all of the instances in which this happens.

***EDIT: There are many more rows than 4 - millions - I don't think the current solutions work.

I would like to count how many times the string I am currently on, in my for loop, comes up and then delete them from the data.frame. So in the example above, I would like to get rid of individual 1. The df would then leave the other two rows.

So far my approach was this:

  1. Get a list of all the colours

  2. Get a list of all the individuals

  3. Write two for loops.

    colours <- unique(df$colours) ind <- unique(df$individual) for (i in ind) { for (c in colour) { #something here. Probably if, asking if the person I'm on in the loop #is found with the colour I am on, more than once, get rid of them } }

My expected output is this:

colours  individual   value

black   individual 2   1.1

black   individual 3   0.5

Source data

df <- data.frame(colours = c("white", "white", "black", "black"),
                 individual = c("individual 1", "individual 1", "individual 2", "individual 3"),
                 value = c(0.4, 0.7, 1.1, 0.5))
Gotmadstacks
  • 359
  • 6
  • 20
  • Can you update with the expected output – akrun Apr 01 '16 at 12:20
  • Thanks for the suggestion, mtoto. That would work except there is another vector with values that are different. I can't tell which value is the correct one to take, so I am trying to just delete where I find two values for the same individual for the same colour. I will edit this into the question as I have just noticed how that complicates things. – Gotmadstacks Apr 01 '16 at 12:24
  • Can someone unmark this as a duplicate? It's not. The rows are different. I changed the question to reflect this. Sorry for prior confusion. – Gotmadstacks Apr 01 '16 at 12:27
  • Ok, I was only following mtoto's link. I will unmark it if that is not the case. But, I would say the `duplicated(...)|duplicated(..., fromLast=TRUE)` is a dupe. – akrun Apr 01 '16 at 12:28
  • It is still a duplicate as far as I see it, just need to specify the cols as in `df[!(duplicated(df[1:2]) | duplicated(df[1:2], fromLast = TRUE)), ]` – mtoto Apr 01 '16 at 12:31
  • Alright. I'll try it out and report back.Thanks. – Gotmadstacks Apr 01 '16 at 12:32
  • Hard to believe this has not been answered elsewhere on SO, but I don't see any true duplicate questions, where the focus is just on a few columns. – Sam Firke Apr 01 '16 at 13:05
  • `df[1==ave(1:nrow(df), df[,1:2], FUN=length),]` – A. Webb Apr 01 '16 at 14:23

5 Answers5

4

You could try with anti_join() from the dplyr library:

library(dplyr)
anti_join(df1, df1[duplicated(df1[1:2]),], by="individual")
#  colours   individual value
#1   black individual 3   0.5
#2   black individual 2   1.1
RHertel
  • 23,412
  • 5
  • 38
  • 64
  • Hi, Does this not only get rid of the first two rows? I'm guessing that to get it to go over all the rows I need to modify so I get rid of it applying to only the first two columns? – Gotmadstacks Apr 04 '16 at 11:32
  • @Gotmadstacks It should eliminate any row in the data.frame which has a combination of entries in the columns one and two occurring more than once. – RHertel Apr 04 '16 at 16:55
2

A straightforward dplyr approach would be to group as desired and filter for groups with fewer than 2 observations:

library(dplyr)
df %>%
  group_by(colours, individual) %>%
  filter(n() < 2)

Source: local data frame [2 x 3]
Groups: colours, individual [2]

  colours   individual value
   (fctr)       (fctr) (dbl)
1   black individual 2   1.1
2   black individual 3   0.5
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
1

Here is another option using data.table

library(data.table)
setDT(df1)[, if(.N==1) .SD , .(colours, individual)]
#   colours   individual value
#1:   black individual 2   1.1
#2:   black individual 3   0.5
akrun
  • 874,273
  • 37
  • 540
  • 662
1

On the basis of some suggestions in the comments, this answer worked best:

df[!(duplicated(df[,1:2]) | duplicated(df[,1:2], fromLast = TRUE)), ]

Slightly different to the comments. This specifies the columns rather than the rows and so achieves the result I wanted from the question (remove those rows where individual and colour are duplicated). More useful generally because the example data in the question is only four rows as opposed to millions.

Gotmadstacks
  • 359
  • 6
  • 20
0

This should do. I created a sample dataset, added index vector to show that you save only the first occurence of a colour-user occurence. This works is your rownames are actual row-number.

## Data preparation
colours <- sample(c("red","blue","green","yellow"), size = 50, replace = T)
users <- sample(1:10, size=50, replace=T )
df <- data.frame(colours,users)
df$value <- runif(50)
df$index <- 1:50

## Keep only the first occurence
res <- unique(df[,1:2])
res$values <- df$value[as.integer(rownames(res))]
Ujjwal Kumar
  • 581
  • 3
  • 12