R: removing duplicated entries if they come within a year

Question

Im new to R, I have a data frame of 500000 entries of patient IDs and dates and other variables..

I want to remove any repeated duplicated patient ID(PtID) if they happen to come within one year of their first appearance.. for example:

 PtID    date**
 1. 1    01/01/2006
 2. 2    01/01/2006
 3. 1    24/02/2006 
 4. 4    26/03/2006
 5. 1    04/05/2006
 6. 1    05/05/2007

in this case I want to remove the 3rd and the 5th rows and keep the 1st and 6th rows..

can somebody help me with this please.. this is the str(my data which is called final1)

str(final1)
'data.frame':   605870 obs. of  70 variables:
...
 $ Date          : Date, format: "2006-03-12" "2006-04-01" ...
$ PtID          : int  11251 11251 11251 11251 11251 11251 11251 30938 30938 11245 ...
...

Can you update your question with either `str` of your data OR paste the results of `dput(head(yourData))`. Working with dates is a bit tricky and a good answer will need to know how the date column is stored. Other good advice on making reproducible examples can be found [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Chase, May 09 '12 at 13:03

score 2 · Accepted Answer · answered May 09 '12 at 13:43

Here's one solution that uses ply and lubridate. First load the packages:

require(plyr)
require(lubridate)

Next create some sample data (notice that this is a bit more straightforward than your example!)

num = 1:6
PtID = c(1,2,1,4,1,1)
date = c("01/01/2006", "01/01/2006","24/02/2006", "26/03/2006", "04/05/2006",
  "05/05/2007")
dd = data.frame(PtID, date)

Now we make the date column an R date object:

dd$date = dmy(date)

and a function that contains the rule of whether a row should be included:

keepId = function(dates) {
  keep = ((dates - min(dates)) > 365*24*60*60) |
  ((dates == min(dates)))
  return(keep)
}

All that remains is using ddply to partition the date frame by the PtID

dd_sub = ddply(dd, c("PtID"), transform, keep = keepId(date))
dd_sub[dd_sub$keep,]

thank you very much,, I will try it now.. sorry for the bad example. Im really new to the site and R :) — adil wahaibi, May 09 '12 at 14:06
Everyone takes a while to get the hang of it. Just read previous posts. — csgillespie, May 09 '12 at 14:08

R: removing duplicated entries if they come within a year

1 Answers1