9

I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subjects has a DOB entered incorrectly.

test <- data.frame(
    ID=c(rep(1,3),rep(2,4),rep(3,2)),
    DOB = c(rep("2000-03-01",3), "2000-05-06", "2002-05-06",
     "2000-05-06", "2000-05-06", "2004-04-06", "2004-04-06")
)

> test
  ID        DOB
1  1 2000-03-01
2  1 2000-03-01
3  1 2000-03-01
4  2 2000-05-06
5  2 2002-05-06
6  2 2000-05-06
7  2 2000-05-06
8  3 2004-04-06
9  3 2004-04-06

What I am after is some code to basically identify that '2' has an error. I can think of some round about ways using a for loop but that would be computationally inefficient.

Thanks

Chase
  • 67,710
  • 18
  • 144
  • 161
nzcoops
  • 9,132
  • 8
  • 41
  • 52

4 Answers4

6

Using base functions, the fastest solution would be something like :

> x <- unique(test[c("ID","DOB")])
> x$ID[duplicated(x$ID)]
[1] 2

Timing :

n <- 1000
system.time(replicate(n,{
  x <- unique(test[c("ID","DOB")])
  x$ID[duplicated(x$ID)]
 }))
   user  system elapsed 
   0.70    0.00    0.71 

system.time(replicate(n,{
  DOBError(data)
}))
   user  system elapsed 
   1.69    0.00    1.69 

system.time(replicate(n,{
  zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
  zzz[zzz$DOB > 1 ,]
}))
   user  system elapsed 
   4.23    0.02    4.27 

system.time(replicate(n,{
   zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
   zz[zz$dups > 1 ,]
}))
   user  system elapsed 
   6.63    0.01    6.64 
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • Interesting. Thanks Joris. I have to wait 8 hours to post an answer but I too did some speed testing on 100k rows from my actual data. Oddly I had the aggregate function running substantially quicker than the DOBError function. – nzcoops Apr 21 '11 at 10:41
  • @nzcoops : It depends on the size of the dataframe as well. Sometimes code can be faster when replicated 1000 times on small data, but slower when done on one dataframe that's 1000 times larger. – Joris Meys Apr 21 '11 at 12:09
  • ahhhh `ddply()` is such a pig in comparison! I guess that's the price you pay for convenience (sometimes). I wonder what effect flipping the `.paralell` switch may have. +1 for the time comparisons. – Chase Apr 21 '11 at 12:11
5

One approach using plyr:

library(plyr)
  zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
  zz[zz$dups > 1 ,]

And if base R is your thing, using aggregate()

zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
zzz[zzz$DOB > 1 ,]
Chase
  • 67,710
  • 18
  • 144
  • 161
  • Nice thanks! Another new function I didn't know about. Out of curiosity, do you know of a way to get that to process a vector the same length as nrow(test)? As in 1 1 1 2 2 2 2 1 1? Just thinking about the application of this to other functions and where you might like to keep the data in that longitudinal format. – nzcoops Apr 21 '11 at 04:22
  • @nzcoops - in the case of the `plyr` solution, replace `summarise` with `transform`. It's not immediately clicking in my head how to adjust the `aggregate` solution, but I'll update this answer when I do. – Chase Apr 21 '11 at 04:29
  • Here is one `aggregate` possibility: `aggregate(test$DOB, by = list(test$ID), function(x)length(unique(x))>1)` – Greg Apr 21 '11 at 04:47
3

With such large volume I propose some other solution, based on comparisons and use power of vector operations in R:

test <- test[order(test$ID), ]
n <- nrow(test)
ind <- test$ID[-1] == test$ID[-n] & test$DOB[-1] != test$DOB[-n]
unique(test$ID[c(FALSE,ind)])

For test data timing is similar to Joris idea, but for large data:

test2 <- data.frame(
    ID = rep(1:600000,3),
    DOB = "2000-01-01",
    stringsAsFactors=FALSE
)
test2$DOB[sample.int(nrow(test2),5000)] <- "2000-01-02"

system.time(resA<-{
    x <- unique(test2[c("ID","DOB")])
    x$ID[duplicated(x$ID)]
})
#   user  system elapsed 
#   7.44    0.14    7.58 

system.time(resB <- {
    test2 <- test2[order(test2$ID), ]
    n <- nrow(test2)
    ind <- test2$ID[-1] == test2$ID[-n] & test2$DOB[-1] != test2$DOB[-n]
    unique(test2$ID[c(FALSE,ind)])
})
#   user  system elapsed 
#   0.76    0.04    0.81 

all.equal(sort(resA),sort(resB))
# [1] TRUE
Community
  • 1
  • 1
Marek
  • 49,472
  • 15
  • 99
  • 121
2
DOBError <- function(data){

     count <- unlist(lapply(split(test, test$ID), 
        function(x)length(unique(x$DOB))))

     return(names(count)[count > 1])

}


DOBError(data)

[1] "2"
Greg
  • 11,564
  • 5
  • 41
  • 27
  • Thanks, hadn't seen unlist or split before, but length(unique()) was what I was thinking as well. – nzcoops Apr 21 '11 at 04:06