Within ID, check for matches/differences

Question

I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subjects has a DOB entered incorrectly.

test <- data.frame(
    ID=c(rep(1,3),rep(2,4),rep(3,2)),
    DOB = c(rep("2000-03-01",3), "2000-05-06", "2002-05-06",
     "2000-05-06", "2000-05-06", "2004-04-06", "2004-04-06")
)

> test
  ID        DOB
1  1 2000-03-01
2  1 2000-03-01
3  1 2000-03-01
4  2 2000-05-06
5  2 2002-05-06
6  2 2000-05-06
7  2 2000-05-06
8  3 2004-04-06
9  3 2004-04-06

What I am after is some code to basically identify that '2' has an error. I can think of some round about ways using a for loop but that would be computationally inefficient.

Thanks

Joris Meys · Answer 1 · 2011-04-21T09:48:17.167

6

Using base functions, the fastest solution would be something like :

> x <- unique(test[c("ID","DOB")])
> x$ID[duplicated(x$ID)]
[1] 2

Timing :

n <- 1000
system.time(replicate(n,{
  x <- unique(test[c("ID","DOB")])
  x$ID[duplicated(x$ID)]
 }))
   user  system elapsed 
   0.70    0.00    0.71 

system.time(replicate(n,{
  DOBError(data)
}))
   user  system elapsed 
   1.69    0.00    1.69 

system.time(replicate(n,{
  zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
  zzz[zzz$DOB > 1 ,]
}))
   user  system elapsed 
   4.23    0.02    4.27 

system.time(replicate(n,{
   zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
   zz[zz$dups > 1 ,]
}))
   user  system elapsed 
   6.63    0.01    6.64

edited Apr 21 '11 at 09:48

answered Apr 21 '11 at 09:41

Joris Meys

106,551
31
221
263

Interesting. Thanks Joris. I have to wait 8 hours to post an answer but I too did some speed testing on 100k rows from my actual data. Oddly I had the aggregate function running substantially quicker than the DOBError function. – nzcoops Apr 21 '11 at 10:41
@nzcoops : It depends on the size of the dataframe as well. Sometimes code can be faster when replicated 1000 times on small data, but slower when done on one dataframe that's 1000 times larger. – Joris Meys Apr 21 '11 at 12:09
ahhhh `ddply()` is such a pig in comparison! I guess that's the price you pay for convenience (sometimes). I wonder what effect flipping the `.paralell` switch may have. +1 for the time comparisons. – Chase Apr 21 '11 at 12:11

Chase · Accepted Answer · 2011-04-21T04:18:26.007

5

One approach using plyr:

library(plyr)
  zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
  zz[zz$dups > 1 ,]

And if base R is your thing, using aggregate()

zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
zzz[zzz$DOB > 1 ,]

edited Apr 21 '11 at 04:18

answered Apr 21 '11 at 04:08

Chase

67,710
18
144
161

Nice thanks! Another new function I didn't know about. Out of curiosity, do you know of a way to get that to process a vector the same length as nrow(test)? As in 1 1 1 2 2 2 2 1 1? Just thinking about the application of this to other functions and where you might like to keep the data in that longitudinal format. – nzcoops Apr 21 '11 at 04:22
@nzcoops - in the case of the `plyr` solution, replace `summarise` with `transform`. It's not immediately clicking in my head how to adjust the `aggregate` solution, but I'll update this answer when I do. – Chase Apr 21 '11 at 04:29
Here is one `aggregate` possibility: `aggregate(test$DOB, by = list(test$ID), function(x)length(unique(x))>1)` – Greg Apr 21 '11 at 04:47

score 3 · Answer 3 · edited May 23 '17 at 11:57

With such large volume I propose some other solution, based on comparisons and use power of vector operations in R:

test <- test[order(test$ID), ]
n <- nrow(test)
ind <- test$ID[-1] == test$ID[-n] & test$DOB[-1] != test$DOB[-n]
unique(test$ID[c(FALSE,ind)])

For test data timing is similar to Joris idea, but for large data:

test2 <- data.frame(
    ID = rep(1:600000,3),
    DOB = "2000-01-01",
    stringsAsFactors=FALSE
)
test2$DOB[sample.int(nrow(test2),5000)] <- "2000-01-02"

system.time(resA<-{
    x <- unique(test2[c("ID","DOB")])
    x$ID[duplicated(x$ID)]
})
#   user  system elapsed 
#   7.44    0.14    7.58 

system.time(resB <- {
    test2 <- test2[order(test2$ID), ]
    n <- nrow(test2)
    ind <- test2$ID[-1] == test2$ID[-n] & test2$DOB[-1] != test2$DOB[-n]
    unique(test2$ID[c(FALSE,ind)])
})
#   user  system elapsed 
#   0.76    0.04    0.81 

all.equal(sort(resA),sort(resB))
# [1] TRUE

Nice! Had to think a bit what was going on, but beautiful solution. — Joris Meys, Apr 22 '11 at 09:05

score 2 · Answer 4 · answered Apr 21 '11 at 04:02

2

DOBError <- function(data){

     count <- unlist(lapply(split(test, test$ID), 
        function(x)length(unique(x$DOB))))

     return(names(count)[count > 1])

}


DOBError(data)

[1] "2"

answered Apr 21 '11 at 04:02

Greg

11,564
5
41
27

Thanks, hadn't seen unlist or split before, but length(unique()) was what I was thinking as well. – nzcoops Apr 21 '11 at 04:06

Within ID, check for matches/differences

4 Answers4

Linked