6

I'm using rmongodb to get every document in a a particular collection. It works but I'm working with millions of small documents, potentially 100M or more. I'm using the method suggested by the author on the website: cnub.org/rmongodb.ashx

count <- mongo.count(mongo, ns, query)
cursor <- mongo.find(mongo, query)
name <- vector("character", count)
age <- vector("numeric", count)
i <- 1
while (mongo.cursor.next(cursor)) {
    b <- mongo.cursor.value(cursor)
    name[i] <- mongo.bson.value(b, "name")
    age[i] <- mongo.bson.value(b, "age")
    i <- i + 1
}
df <- as.data.frame(list(name=name, age=age))

This works fine for hundreds or thousands of results but that while loop is VERY VERY slow. Is there some way to speed this up? Maybe an opportunity for multiprocessing? Any suggestions would be appreciated. I'm averaging 1M per hour and at this rate I'll need a week just to build the data frame.

EDIT: I've noticed that the more vectors in the while loop the slower it gets. I'm now trying to loop separately for each vector. Still seems like a hack though, there must be a better way.

Edit 2: I'm having some luck with data.table. Its still running but it looks like it will finish the 12M (this is my current test set) in 4 hours, that's progress but far from ideal

dt <- data.table(uri=rep("NA",count),
                 time=rep(0,count),
                 action=rep("NA",count),
                 bytes=rep(0,count),
                 dur=rep(0,count))

while (mongo.cursor.next(cursor)) {
  b <- mongo.cursor.value(cursor)
  set(dt, i, 1L,  mongo.bson.value(b, "cache"))
  set(dt, i, 2L,  mongo.bson.value(b, "path"))
  set(dt, i, 3L,  mongo.bson.value(b, "time"))
  set(dt, i, 4L,  mongo.bson.value(b, "bytes"))
  set(dt, i, 5L,  mongo.bson.value(b, "elaps"))

}

Community
  • 1
  • 1
rjb101
  • 514
  • 5
  • 14
  • I am no r programmer, infact I have never used it however why don't you pick out the subsets of data you need instead of just iterating over the whole collection and then performing the validation required? In this case it would easily be faster to send like 6 cursors server-side instead of just one. – Sammaye Dec 20 '12 at 08:15
  • Huh? Of course the more vectors in the while loop the slower it gets. There's more to do. So it takes longer. Or is it non-linear? How does it behave with different values of the number of things you are looping over? Or by 'more vectors' do you mean more things like age and name? Not clear. – Spacedman Dec 20 '12 at 08:47
  • @Sammaye, thats exactly what I meant by looping separately for each vector. I tried that last night, put a counter in that loop and it appears to have just died, it stopped printing after several hours. The rsession is just hung. So this method didn't help. – rjb101 Dec 20 '12 at 14:28
  • @Spacedman, not of course. Its just assigning values to the vector, it should not get exponentially slower. To answer your question, the age and name ARE the vectors so more vectors means more things like age and name. With just one vector, the loop finished in 30 minutes. There is no computation going on, just the assignment of values. – rjb101 Dec 20 '12 at 14:33
  • Also, This is high velocity time-series data, I can't subset, I need all of it. – rjb101 Dec 20 '12 at 14:34
  • 1
    You should be able to subset for test purposes. Another thought - it shouldn't take much effort to do this simple loop in Python or another language - that might tell you if its R or your MongoDB performance. – Spacedman Dec 20 '12 at 15:34
  • I can subset and its much quicker, I'm doing that right now to develop the plots but for this to work I need ALL the data. I'm having some luck with data.frame. I'll post that below. – rjb101 Dec 20 '12 at 20:50
  • 1
    You may consider writing a query for mongodb that outputs the entire result set to csv. And then load that csv into R in a single call. That is, use a single write/read file/IO design, as opposed to a while loop in R that talks to mongodb directly. Or, if you can vectorize the while loop (i.e., remove the while call), you may be able to achieve a fast call directly from R. I'd be interested if you manage to achieve this within R. – Clayton Stanley Dec 20 '12 at 23:44

2 Answers2

3

You might want to try the mongo.find.exhaust option

cursor <- mongo.find(mongo, query, options=[mongo.find.exhaust])

This would be the easiest fix if actually works for your use case.

However the rmongodb driver seems to be missing some extra features available on other drivers. For example the JavaScript driver has a Cursor.toArray method. Which directly dumps all the find results to an array. The R driver has a mongo.bson.to.list function, but a mongo.cursor.to.list is probably what you want. It's probably worth pinging the driver developer for advice.

A hacky solution could be to create a new collection whose documents are data "chunks" of 100000 of the original documents each. Then these each of these could be efficiently read with mongo.bson.to.list. The chunked collection could be constructed using the mongo server MapReduce functionality.

mjhm
  • 16,497
  • 10
  • 44
  • 55
  • I can't find any explanation on how mongo.find.exhaust would improve the speed. Do you know how it actually works? – pam Jan 15 '13 at 09:51
  • My limited understanding is that it forces the retrieval of all the query matches at once. It could improve speed if the overhead of repeated calls from the cursor.next to the database is significant. I would give it only a 3% chance of actually helping in this use case, but it's a simple change so worth a try. My best reference to it is http://mongodb.github.com/node-mongodb-native/api-generated/collection.html#find – mjhm Jan 15 '13 at 16:27
1

I know of no faster way to do this in a general manner. You are importing data from a foreign application and working with an interpreted language and there's no way rmongodb can anticipate the structure of the documents in the collection. The process is inherently slow when you are dealing with thousands of documents.

  • 1
    Thanks Gerald. The docs are kind of light on mongo.find.exhaust, can you elaborate? I added this option and R crashed. – rjb101 Dec 21 '12 at 06:54
  • the problem is in the append each time around with i = i + 1. I believe that R is copying the data structure every time then replacing it, so the larger it gets, the worse it gets. It's go nothing to do with an interpreted language, as in Python this is orders of magnitude faster. – Thomas Browne Sep 15 '13 at 15:57