I have a large data.frame (50M rows) and need to find indices which match a multiple column criteria.
I made a toy example to illustrate the problem.
I read here that it's faster to search though data.tables than data.frames so I converted data.frame into a data.table object
cases<-c(1,3,5)
women<-c("Julia", "Judith", "Juno", "Jane", "Joanna")
data.df<-data.frame("id" = 1:5, "age" = c(20, 30, 40, 50, 60), "name" = c("Joanna","Joe", "Julia", "Juno", "John"))
library(data.table)
data.dt<-as.data.table(data.df)
setkey(data.dt, "id")
I want the results vector to contain age values of records which pass the multiple column criteria, in this case 20, 40, NA. I search using the for loop (which is probably a silly approach, any tips would be greatly appreciated here)
results<-vector()
for (i in 1:length(cases)){
which_id<-cases[i]
ind<-data.dt[id==which_id & name %in% women, which=TRUE]
if(length(ind)==0){results[i]<-NA}
else{results[i]<-data.dt$age[ind]}
}
This would do on a smaller data set but with 500K records in cases and 50M in data.df it is going to take an excess of 12 hours to run. There must be a simpler way, can anyone give a hint?