This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.
I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame
A list of Indices from that Frame : IndexList
(55000 [numIndex
] unique indices)
Its a time series so there are ~ 5 Million rows for 55K unique indices.
The OriginalDataFrame
has been ordered by dataIndex
. All the indices in IndexList
are not present in OriginalDataFrame
. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame
Currently I am running this code using library(foreach)
:
FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}
I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.
Am I doing something exceedingly silly or are there better ways in R to do this?