Remove duplicate entries list with condition in R

Question

I have a list in R called data. Data has columns CustID and EndDate.

What I want to do is to search through the list comparing CustID to find duplicate entries with the sameCustID.

On the found entries I want to compare the EndDate and remove the entry with the lowest value (oldest EndDate) from the list.

I have no idea on how to approach this problem since I am not very used to working with these functions in R.

Please provide a small reproducible example of your list, e.g. using `dput()` — Roman, Sep 06 '16 at 12:28
Is it a `list` or a `data.frame`? Please check [how to create a minimal reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and update your question — talat, Sep 06 '16 at 12:30
To clarify the reason @docendodiscimus is asking this is because lists don't have columns. So it's unclear if you meant list elements or if you meant columns of a data.frame. — Hack-R, Sep 06 '16 at 12:32
Thank you for the replies and sorry for not providing enough information (was unsure of what was needed). The answer below solved my problem for me — P.Berg, Sep 06 '16 at 12:49

score 1 · Accepted Answer · edited Sep 06 '16 at 13:23

CustID  <- c(seq(1,10,1),seq(1,5,1))
EndDate <- c(Sys.Date(),rep(seq(Sys.Date(),Sys.Date()+6, 1),2))

# Let's assume you're starting with a list
data <- list(CustID, EndDate)

The list looks like this:

[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5

[[2]]
 [1] "2016-09-06" "2016-09-06" "2016-09-07" "2016-09-08" "2016-09-09" "2016-09-10" "2016-09-11" "2016-09-12" "2016-09-06"
[10] "2016-09-07" "2016-09-08" "2016-09-09" "2016-09-10" "2016-09-11" "2016-09-12"

# To make matching CustID and EndDate easy let's change it to a DF
df1           <- as.data.frame(data)
colnames(df1) <- c("CustID", "EndDate")

The data.frame looks like this:

   CustID    EndDate
1       1 2016-09-06
2       2 2016-09-06
3       3 2016-09-07
4       4 2016-09-08
5       5 2016-09-09
6       6 2016-09-10
7       7 2016-09-11
8       8 2016-09-12
9       9 2016-09-06
10     10 2016-09-07
11      1 2016-09-08
12      2 2016-09-09
13      3 2016-09-10
14      4 2016-09-11
15      5 2016-09-12

# Find duplicated CustID
dupID <- duplicated(df1$CustID)
dupdf <- df1[df1$CustID %in% df1$CustID[dupID],]

# Remove the entry with the oldest EndDate for each ID
res <- data.frame(CustID=NA, EndDate = as.Date(NA))

for(i in unique(dupdf$CustID)){
  tmp <- dupdf[dupdf$CustID == i, ]
  res <- rbind(res,tmp[!tmp$EndDate == min(tmp$EndDate),])
}
res <- res[!is.na(res$EndDate),]

The result (res) has the repeated customer ID's (custID) with the oldest EndDate removed for each ID:

   CustID    EndDate
11      1 2016-09-08
12      2 2016-09-09
13      3 2016-09-10
14      4 2016-09-11
15      5 2016-09-12

If you want a vectorized solution you might use data.table:

require(data.table)
dupdf <- data.table(dupdf)
dupdf[,.(
  EndDate = max(EndDate)
), by = CustID]

A suggestion from the comments is

data <- as.data.frame(data)
subset(data, as.logical(ave(as.numeric(EndDate), CustID, FUN = function(x) { 
  length(x) == 1L | x != min(x)
})))

Dynamically growing an object in a loop (as you do with `res`) is not really recommended practice — talat, Sep 06 '16 at 13:02
@docendodiscimus You're right, vectorized solutions are better than loops, though loops are common and usually non-problematic except with huge data. I added a vectorized solution. Cheers. — Hack-R, Sep 06 '16 at 13:08
Hack-R, just fyi, the downvote is not from me though I see a potential reason for it (as explained). You can also use base R for this subset without dynamically growing an object. Here's an example: `subset(data, as.logical(ave(as.numeric(EndDate), CustID, FUN = function(x) length(x) == 1L | x != min(x))))` — talat, Sep 06 '16 at 13:15
@docendodiscimus Thanks. I went to add your suggested solution but I ran it and it produced a list of NULLs? — Hack-R, Sep 06 '16 at 13:19
@Hack-R, you need to adjust "data" to your data.frame object. And btw, using data.table, you could do `data[data[, .I[.N == 1L | EndDate != min(EndDate)]]$V1]` (after `setDT(data)`) — talat, Sep 06 '16 at 13:20

Jonno Bourne · Answer 2 · 2016-09-06T14:02:28.237

0

Here is a solution using the dplyr package

data <- list(CustID=c(seq(1,10,1),seq(1,5,1),3,3,3), 
         EndDate=c(Sys.Date(),
                   rep(seq(Sys.Date(),Sys.Date()+6, 1),2),
                   Sys.Date()+6, Sys.Date()+6, Sys.Date()+10
                   ))

#Convert list to data frame and remove oldest duplicates
data %>% 
  do.call(cbind.data.frame,.) %>%
  group_by(CustID) %>% 
  summarise_all(funs(last)) %>% 
  ungroup

ID "3" appears 5 times including a duplicate on the day that is going to be kept. The function "summarise" reduces the grouping variable (in this case ClustID) to a single observation depending on the arguments passed. In this case the "last" function keeps the most recent observation ignoring the duplicate on the same day.

If your data is not in chronological order use arrange(desc(EndDate)) to order the data.

edited Sep 06 '16 at 14:02

answered Sep 06 '16 at 13:33

Jonno Bourne

1,931
1
22
45

If you have multiple columns and don't want to summarise all, then use summarise(EndDate=last(EndDate)) – Jonno Bourne Sep 06 '16 at 13:34
1

extracting the last date is different than removing the first when there can be more than 2 entries per ID. (and you could just do `data %>% data.frame() %>% ...`) – talat Sep 06 '16 at 13:35
Do you mean I should arrange them in chronological order? "arrange(EndDate)"? Then the data could have multiple entries by customer ID and also the initial ordering would be irrelevant. – Jonno Bourne Sep 06 '16 at 13:39
No. Imagine ID 3 has 5 date entries. Now, if you remove the earliest/oldest date of that ID, it will have 4 remaining entries. In contrast, if you only keep the latest date, you will always end up with only 1 entry (row) per ID. – talat Sep 06 '16 at 13:41
Unless I have misunderstood what you are saying, that isn't how summarise works see the edit for clarification. – Jonno Bourne Sep 06 '16 at 14:02

Remove duplicate entries list with condition in R

2 Answers2