Consider a Mongo Database within which each entry has the following data structure.
{
"_id" : ObjectId("numbersandletters"),
"hello" : 0,
"this" : "AUTO",
"is" : "34.324.25.53",
"an" : "7046934",
"example" : 0,
"data" : {
"google" : "SEARCH",
"wikipedia" : "Placeholder",
"twitch" : "2016",
"twitter" : "More_placeholder",
"facebook" : "Run out of ideas",
"stackoverflow" : "is great",
},
"schema" : "",
"that" : "",
"illustrates" : 0,
"the_point" : "/somethinghere.html",
"timestamp" : ISODate("2016-03-05T04:53:20.000Z")
}
The above data structure is an example of a single data observation. There are approximately 12 million observations within the database. The field "this" in the data structure can take the property of either "AUTO" or "MANUAL".
I am currently importing some of the data from Mongo into R using the rmongodb library and then transforming the resulting list into a data frame.
The R code is the following:
library(rmongodb)
m <- mongo.create(host = "localhost", db = "example")
rawData <- mongo.find.all(m, "example.request", query = list(this = "AUTO"),
fields = list(hello = 1L, is = 1L, an = 1L, data.facebook = 1L, the_point = 1L))
rawData <- data.frame(matrix(unlist(rawData), nrow = length(rawData), byrow = TRUE))
The above code works well for relatively small datasets (say, < 1 million observations), but is slow for 12 million.
Is there a smarter (and thus faster) way to import the data from Mongo and then project the resulting data into an R data frame?
Cheers.