0

I have a large data frame with the following fields (example data).

#dput(data) gives...

data <- structure(list(idNum = 1:11, personID = c(111L, 112L, 113L, 113L, 111L, 112L, 114L, 112L, 111L, 113L, 115L), Name = c("PETER PAN", "RUPERT BEAR", "LONG JOHN SILVER", "SILVER LONG JOHN", "PAN PETER", "BEAR RUPERT", "R BEAR", "RUPERT BEAR", "PETER PAN", "LONG J SILVER", "LJ SILVER "), DOB = c("1/01/2001", "2/01/2001", "3/01/2001", "3/01/2001", "1/01/2001", "2/01/2001", "10/01/2001", "2/01/2001", "1/01/2001", "1/01/2001", "5/01/2001"), date = c("12/01/2012", "12/01/2012", "14/01/2012", "12/01/2012", "14/01/2012", "11/01/2012", "10/01/2012", "16/01/2012", "10/01/2012", "16/01/2012", "10/01/2012" ), colour = c("RED", "BLUE", "RED", "GREEN", "YELLOW", "BLUE", "RED", "BLUE", "ORGANGE", "BLUE", "ORANGE"), firstName = c("PETER", "RUPERT", "LONG", "SILVER", "PAN", "BEAR", "R", "RUPERT", "PETER", "LONG", "LJ"), lastName = c("PAN", "BEAR", "SILVER", "JOHN", "PETER", "RUPERT", "BEAR", "BEAR", "PAN", "SILVER", "SILVER")), .Names = c("idNum", "personID", "Name", "DOB", "date", "colour", "firstName", "lastName" ), row.names = c(NA, -11L), class = "data.frame")

The firstName and lastName are not in the original data. The name format in the original data set are generated with a free format entry system. It contains a large number of foreign names so data entry clerks do not accurately collect first name and last name. I derived them using:

data$firstName <-sapply(strsplit(data$Name, split=" "), head, 1)
data$lastName <- sapply(strsplit(data$Name, split=" "), tail, 1)

What I need to achieve is a subset data frame that removes duplicates matched on personID, Name and DOB such that the value returned contains the most entries with the most recent date for each unique case.

That is, I would like to return rows 5, 7, 8, 10 and 11.

I separated first name and last name because I envisaged that it would work by initially extracting cases where lastName == firstName then ording by date. I then tough that I could use case where lastName was in firstName and other consitions were met.

None if this worked and now I am lost.

Is there a relatively simple way to remove duplicates matched on columns personID, Name and DOB retaining the most recent unique cases?

Many thanks in advance.

John
  • 41,131
  • 31
  • 82
  • 106
  • 1
    See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example on how to make a reproducible example from your data. Make sure people can just copy-paste some code to reconstruct that data frame. – Joris Meys Jan 17 '12 at 10:16
  • Oops @Joris Meys, fixed with dput(). – John Jan 17 '12 at 10:50
  • 1
    There are two very different problems. The hard one is identifying when two person names are the same: you may have misspellings, abbreviations, and arbitrary first/last/middle name orders. The easy one is taking the last value for each person: once the personID column has been fixed, just sort the data along the date column, and the last values are: `data[ !duplicated( data$personID, fromLast=TRUE ), ]`. – Vincent Zoonekynd Jan 17 '12 at 11:45
  • 1
    This returns the last entry with the same personID, not necessarily the most recent in terms of `date`; maybe not an issue here but better be aware of. – vaettchen Jan 17 '12 at 12:11
  • thanks @VincentZoonekynd and @vaettchen. I used `data <- ddply(.data=data, .variables= 'date')` to sort the date and selected the most recent cases that I needed (for this test). I will try on real data tomorrow. The names issue is much harder. I think the best approach will be separate to two data.frames, unique and duplicate entries, then explore the problem further. – John Jan 17 '12 at 12:32

1 Answers1

1

I used @Vincent's

data[ !duplicated( data$personID, fromLast=TRUE ), ]

after had sorted by:

data <- ddply(.data=data, .variables= 'date')

John
  • 41,131
  • 31
  • 82
  • 106