R: Speeding up data import using rmongodb

Question

Consider a Mongo Database within which each entry has the following data structure.

{
    "_id" : ObjectId("numbersandletters"),
    "hello" : 0,
    "this" : "AUTO",
    "is" : "34.324.25.53",
    "an" : "7046934",
    "example" : 0,
    "data" : {
        "google" : "SEARCH",
        "wikipedia" : "Placeholder",
        "twitch" : "2016",
        "twitter" : "More_placeholder",
        "facebook" : "Run out of ideas",
        "stackoverflow" : "is great",
    },
    "schema" : "",
    "that" : "",
    "illustrates" : 0,
    "the_point" : "/somethinghere.html",
    "timestamp" : ISODate("2016-03-05T04:53:20.000Z")
}

The above data structure is an example of a single data observation. There are approximately 12 million observations within the database. The field "this" in the data structure can take the property of either "AUTO" or "MANUAL".

I am currently importing some of the data from Mongo into R using the rmongodb library and then transforming the resulting list into a data frame.

The R code is the following:

library(rmongodb)

m <- mongo.create(host = "localhost", db = "example")

rawData <- mongo.find.all(m, "example.request", query = list(this = "AUTO"), 
                           fields = list(hello = 1L, is = 1L, an = 1L, data.facebook = 1L, the_point = 1L))

rawData <- data.frame(matrix(unlist(rawData), nrow = length(rawData), byrow = TRUE))

The above code works well for relatively small datasets (say, < 1 million observations), but is slow for 12 million.

Is there a smarter (and thus faster) way to import the data from Mongo and then project the resulting data into an R data frame?

Cheers.

Possible duplicate of [speed up large result set processing using rmongodb](http://stackoverflow.com/questions/13965261/speed-up-large-result-set-processing-using-rmongodb) — profesor79, Apr 02 '16 at 20:39

SymbolixAU · Accepted Answer · 2016-04-22T22:35:10.537

1

Take a look at the mongolite package. You should get some speed gains for a couple of million results.

library(mongolite)
mongo <- mongo(collection = "request", db = "example", url = "mongodb://localhost") 

df <- mongo$find(query = '{ "this" : "AUTO" }', fields = '{ "_id" : 0, "hello" : 1, "is" : 1, "an" : 1, "data.facebook" : 1, "the_point" : 1 }')

However, as your result set grows, the process of converting to a data.frame slows down.

For this reason I've been experimenting with speeding up mongolite by removing the recursive calls to try and flatten the JSON structure in the query, and relying on data.table to rbindlist the cursor (to avoid the mongolite::simplify function that turns it into a data.frame). This returns a data.table object

This mongolitedt package is still in development, and any query you send must be able to be coerced into a data.table through rbindlist. On the pacakge home page I've added some benchmarks to show the speed-up this gives.

## install the package with
library(devtools)
install_github("SymbolixAU/mongolitedt")

library(mongolitedt)
## requires data.table and mongolite

# rm(mongo); gc()
mongo <- mongo(collection = "request", db = "example", url = "mongodb://localhost") 
bind_mongolitedt(mongo)   ## bind dt functions to mongolite connection object


dt <- mongo$finddt(query = '{ "this" : "AUTO" }', fields = '{ "_id" : 0, "hello" : 1, "is" : 1, "an" : 1, "data.facebook" : 1, "the_point" : 1 }')

edited Apr 22 '16 at 22:35

answered Apr 22 '16 at 21:38

SymbolixAU

25,502
4
67
139

Fantastic! I had never used `mongolite`, but it seems to be incredibly fast. Two things: (1) STREAMING DATA!! My mongodb is continually increasing and right now I am pulling data daily at midnight and running some analysis on the data; streaming the data continually would be an incredible addition; I didn't realise R could do this...! Anywhere I could go to get more information on this? (2) When using `mongo$finddt` is there anyway to coerce it to use the `rbindlist` functions of `use.names = TRUE, fill = TRUE`? – Futh Apr 24 '16 at 14:50
@Futh For (2) - I've updated it to use `rbindlist(..., use.names = T, fill = T)` for all `rbindlsits`. ( I thought I had already done this, but I missed it in one place). – SymbolixAU Apr 24 '16 at 21:38
@Futh And for (1) - are you after more information about the [`stream_in`](http://www.rdocumentation.org/packages/jsonlite/functions/stream_in) function from [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf)? – SymbolixAU Apr 24 '16 at 21:40
Great! I noticed that the `rbindlist(..., use.names = T, fill = T)` was not there for `try_rbind_page.R`, which was messing up my attempts earlier. Thanks for changing this! I did some reading around the `stream_in` function today could be extremely useful... I am working on a program that does 'real-time' analysis of incoming data to Mongo (including the analysis and graphing of the data)... the `stream_in` and `stream_out` function could be of use; just wish there was more documentation on it. – Futh Apr 24 '16 at 22:22
@Futh Have you looked into a `spark` / `sparkR` approach? I'm not too familiar with it myself but could be of use. Are you hosting your project online; I'd be interested to see how you get on with it? – SymbolixAU Apr 24 '16 at 23:00

R: Speeding up data import using rmongodb

1 Answers1