I have a large data frame with the following fields (example data).
#dput(data) gives...
data <- structure(list(idNum = 1:11, personID = c(111L, 112L, 113L, 113L,
111L, 112L, 114L, 112L, 111L, 113L, 115L), Name = c("PETER PAN",
"RUPERT BEAR", "LONG JOHN SILVER", "SILVER LONG JOHN", "PAN PETER",
"BEAR RUPERT", "R BEAR", "RUPERT BEAR", "PETER PAN", "LONG J SILVER",
"LJ SILVER "), DOB = c("1/01/2001", "2/01/2001", "3/01/2001",
"3/01/2001", "1/01/2001", "2/01/2001", "10/01/2001", "2/01/2001",
"1/01/2001", "1/01/2001", "5/01/2001"), date = c("12/01/2012",
"12/01/2012", "14/01/2012", "12/01/2012", "14/01/2012", "11/01/2012",
"10/01/2012", "16/01/2012", "10/01/2012", "16/01/2012", "10/01/2012"
), colour = c("RED", "BLUE", "RED", "GREEN", "YELLOW", "BLUE",
"RED", "BLUE", "ORGANGE", "BLUE", "ORANGE"), firstName = c("PETER",
"RUPERT", "LONG", "SILVER", "PAN", "BEAR", "R", "RUPERT", "PETER",
"LONG", "LJ"), lastName = c("PAN", "BEAR", "SILVER", "JOHN",
"PETER", "RUPERT", "BEAR", "BEAR", "PAN", "SILVER", "SILVER")), .Names = c("idNum",
"personID", "Name", "DOB", "date", "colour", "firstName", "lastName"
), row.names = c(NA, -11L), class = "data.frame")
The firstName and lastName are not in the original data. The name format in the original data set are generated with a free format entry system. It contains a large number of foreign names so data entry clerks do not accurately collect first name and last name. I derived them using:
data$firstName <-sapply(strsplit(data$Name, split=" "), head, 1)
data$lastName <- sapply(strsplit(data$Name, split=" "), tail, 1)
What I need to achieve is a subset data frame that removes duplicates matched on personID, Name and DOB such that the value returned contains the most entries with the most recent date for each unique case.
That is, I would like to return rows 5, 7, 8, 10 and 11.
I separated first name and last name because I envisaged that it would work by initially extracting cases where lastName == firstName
then ording by date. I then tough that I could use case where lastName was in firstName and other consitions were met.
None if this worked and now I am lost.
Is there a relatively simple way to remove duplicates matched on columns personID, Name and DOB retaining the most recent unique cases?
Many thanks in advance.