Aggregate is quite a fast one, but you can use an sapply solution to parallelize the code. This can easily be done on Windows using snowfall
:
require(snowfall)
sfInit(parallel=TRUE,cpus=2)
sfExport("Data")
ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
sfStop()
The parallel version will give you an extra speedup, but the single sapply solution is actually slower than aggregate. Tapply is a bit faster, but can't be parallelized using snowfall. on my computer :
n <- 3000
m <- 3
Data <- data.frame( ID = rep(1:n,m),
Names=rep(LETTERS[1:m],each=n))
# using snowfall for parallel sapply
system.time({
ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
})
user system elapsed
0.02 0.00 0.33
# using tapply
system.time({
CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)
})
user system elapsed
0.44 0.00 0.44
# using aggregate
system.time(
aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ")
)
user system elapsed
0.47 0.00 0.47
# using the normal sapply
system.time({
ID <- unique(Data$ID)
CombNames <- sapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
})
user system elapsed
0.75 0.00 0.75
Note:
For the record, the better sapply-solution would be :
CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)
which is equivalent to tapply. But parallelizing this one is actually slower, as you have to move more data around within the sfSapply
. The speed comes from copying the dataset to every cpu. This is what you have to keep in mind when your dataset is huge : you'll pay the speed with more memory usage.