4

I tried to import data from mongodb to r using:

mongo.find.all(mongo, namespace, query=query,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ), data.frame= T)

The command works find for small data sets, but I want to import 1,000,000 documents.

Using system.time and adding limit= X to the command, I measure the time as a function of the data to import:

system.time(mongo.find.all(mongo, namespace, query=query ,
fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ),
limit= 10000, data.frame= T))

The results:

Data Size   Time
1           0.02
100         0.29
1000        2.51
5000        16.47
10000       20.41
50000       193.36
100000      743.74
200000      2828.33

After plotting the data I believe that: Import Time = f(Data^2)

Time = -138.3643 + 0.0067807*Data Size + 6.773e-8*(Data Size-45762.6)^2

R^2 = 0.999997

  1. Am I correct?
  2. Is there a faster command?

Thanks!

1 Answers1

3

lm is cool, but I think if you'll try to add power 3,4,5, ... features, you'll also receive great R^2 =) you overfit=)

One of the known R's drawbacks is that you can't efficiently append elements to vector (or list). Appending element triggers copy of the entire object. And here you can see derivative of this effect. In general when you fetching data from mongodb, you don't know size of the result in advance. You iterate though cursor and grow resulting list. In older versions this procedure was incredibly slow because of described above R's behaviour. After this pull performance become much better. Trick with environments helps a lot, but it still not as fast as preallocated list.

But can we potentially do better? Yes.

1) Simply allow user to point size of the result and preallocate list. And do it automatically if limit= is passed into mongo.find.all. I filled issue for this enhancement.
2) Construct result in C code.

If know size of your data in advance you can:

cursor <- mongo.find(mongo, namespace, query=query, fields= list('_id'= 0, 'entityEventName'= 1, context= 1, 'startTime'=1 ))
result_lst <- vector('list', NUMBER_OF_RECORDS)
i <- 1
while (mongo.cursor.next(cursor)) {
  result_lst[[i]] <- mongo.bson.to.list(mongo.cursor.value(cursor))
  i <- i + 1
}
result_dt <- data.table::rbindlist(result_lst)
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thx Dmitriy! In most cases I want to extract all the docs in my collection so I know the size of the result (using "mongo.count"). But I don't understand what is: cursor <- mongo.find(...)? Also, I believe it's possible to use mapply instead of the loop - calculation time will be dramatically shorter :) – Sagi Hilleli Aug 13 '15 at 15:18
  • see updated code. `mapply` will not help, because you have to iterate through cursor. If you don't understand how it works, please see MongoDB manual: http://docs.mongodb.org/manual/core/cursors/ – Dmitriy Selivanov Aug 13 '15 at 16:03
  • within the `while` loop shouldn't it be `result_lst[[i]]` ? – tospig Jan 01 '16 at 20:00