0

I want to group the data by ID and slice the first row, based on condition.

I have the following dataset:

head(data) 

    ID Cond1 
     A   10
     A   10
     B   20
     B   30

Now, I want to slice the rows based on condition:

  • If value in Cond1 is unique for both rows, keep them;
  • If value in Cond1 is duplicate, keep top row.

Any ideas?

Prometheus
  • 1,977
  • 3
  • 30
  • 57
  • @Jaap The question is different (if you have read it) – akrun May 23 '17 at 15:16
  • @Jaap actually, its not a duplicate. Thanks for the negative reputation though. – Prometheus May 23 '17 at 15:26
  • @Prometheus I only closed the question; I didn't downvote. As the question now stands, it is a duplicate. If you think it is not, please clarify and update your question. – Jaap May 23 '17 at 15:28
  • @Jeep Well, the question you pointed asks for removing duplicate rows in general. I needed to group the data, and remove duplicates in a specific column. This is a subtle, but key difference because my original data contains duplicates in other rows that shouldn't be removed. – Prometheus May 23 '17 at 15:36
  • 1
    @Prometheus I look at the accepted answer of the linked question, it is exactly doing what you need: It only removes duplicates for specific columns. – Jaap May 23 '17 at 15:54

2 Answers2

3

You can use the base R function ave like this:

datafr[!(ave(datafr$Cond1, datafr$ID, FUN=duplicated)),]
  ID Cond1
1  A    10
3  B    20
4  B    30

ave returns a numeric vector by ID with a 1 if the element of Cond1 is duplicated and a 0 if it is not. the ! performs two roles, first it converts the resulting vector to a logical vector appropriate for subetting. Second it reverses the results, keeping the non-duplicate elements.

In data.table, you could use a join.

setDT(datafr)[datafr[, !duplicated(Cond1), by=ID]$V1]
   ID Cond1
1:  A    10
2:  B    20
3:  B    30

The inner data.frame returns a logical for not duplicated elements by ID and is pulled out into a vector via $V1. This logical vector is fed to the original data.table to perform the subsetting.

data

datafr <-
structure(list(ID = c("A", "A", "B", "B"), Cond1 = c(10L, 10L, 
20L, 30L)), .Names = c("ID", "Cond1"), row.names = c(NA, -4L), class = "data.frame")
lmo
  • 37,904
  • 9
  • 56
  • 69
2

We can use n_distinct to filter

library(dplyr)
data %>%
     group_by(ID) %>% 
     filter(n_distinct(Cond1)==n()| row_number()==1)

Or just

data[!duplicated(data),]
#   ID Cond1
#1  A    10
#3  B    20
#4  B    30

Based on the description in the OP's post, if we include another row with B 20, the first solution should give

data %>%
  group_by(ID) %>% 
  filter(n_distinct(Cond1)==n()| row_number()==1)
# A tibble: 2 x 2
# Groups: ID [2]
#     ID Cond1
#  <chr> <int>
#1     A    10
#2     B    20
akrun
  • 874,273
  • 37
  • 540
  • 662