R Checking for duplicates is painfully slow, even with mclapply

Question

I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.

Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.

I'm trying to do this with the following code:

library("doMC")

Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big"))
compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))}))

IsGoodId = function(Id){
  Sub = Data[Data$ID==Id,]
  if (length(Sub[,1]) > 1){
    return(compare(Sub[,"Size"]))
  }else{
    return(TRUE)
  }
}

WhichAreGood = mclapply(unique(Data$ID),IsGoodId)

But it's painfully, awfully, terribly slow on my quad-core i5.

Can anyone see where the bottleneck is? I'm a newbie to R optimisation.

Thanks, -N

Hi there! Please make your post reproducible by having a look at [**How to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for us to help you. Thank you. — Arun, Mar 28 '13 at 19:42
Checking for invalid entries might be more efficient, but I would not think that checking for duplicates should be made more efficient with parallel processing. — IRTFM, Mar 28 '13 at 19:48
I suggest you find "R inferno" by Patrick Burns. It's an eye opener. — Roman Luštrik, Mar 28 '13 at 21:10

Martin Morgan · Accepted Answer · 2013-03-28T20:21:54.977

Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.

dups = unique(Data$ID[duplicated(Data$ID)])
DupData = Data[Data$ID %in% dups,,drop=FALSE]

The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size

tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)

This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...

Hmm, thinking about this a bit more, I guess

u = unique(Data)
u$ID[duplicated(u$ID)]

does the trick.

Ahah, you're a hero. I was starting to realise it had quadratic complexity, but couldn't muddle through. Thank you! — N. McA., Mar 28 '13 at 20:10

R Checking for duplicates is painfully slow, even with mclapply

1 Answers1

Linked