I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx
count <- mongo.count(mongo, ns, query)
cursor <- mongo.find(mongo, query)
name <- vector("character", count)
age <- vector("numeric", count)
i <- 1
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
name[i] <- mongo.bson.value(b, "name")
age[i] <- mongo.bson.value(b, "age")
i <- i + 1
}
df <- as.data.frame(list(name=name, age=age))
This works fine for hundreds or thousands of results but that while loop is VERY VERY slow. Is there some way to speed this up? Maybe an opportunity for multiprocessing? Any suggestions would be appreciated. I'm averaging 1M per hour and at this rate I'll need a week just to build the data frame.
EDIT: I've noticed that the more vectors in the while loop the slower it gets. I'm now trying to loop separately for each vector. Still seems like a hack though, there must be a better way.
Edit 2: I'm having some luck with data.table. Its still running but it looks like it will finish the 12M (this is my current test set) in 4 hours, that's progress but far from ideal
dt <- data.table(uri=rep("NA",count),
time=rep(0,count),
action=rep("NA",count),
bytes=rep(0,count),
dur=rep(0,count))
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
set(dt, i, 1L, mongo.bson.value(b, "cache"))
set(dt, i, 2L, mongo.bson.value(b, "path"))
set(dt, i, 3L, mongo.bson.value(b, "time"))
set(dt, i, 4L, mongo.bson.value(b, "bytes"))
set(dt, i, 5L, mongo.bson.value(b, "elaps"))
}