Introduction/Measures
I am working with a MongoDB database with 10 GB of records (nearly 3 million records).
Each record (document) has a field called DomainClass
(which is one out of 11 different classes, previously defined by us).
What i'm trying to accomplish
For statistics reasons, i have to extract from this database, 100 records of each type of DomainClass
, and i can't simply get the first 100, because the sample would be biased. I need those 100 records to be randomized within the database.
What i have tried:
This is basically, what i have tried (in C#).
1 - Count the number of records that belongs to a certain DomainClass
.
2 - Randomize 100 numbers between 0 and the count
3- Find all the records that belong to that DomainClass
4- Put them in memory, as a list
5 - Use all the previously randomized integers (100) as a index to this list (to solve the randomization need).
Flaws
I'm afraid that, i won't be able to allocate enough memory (RAM) for all the records of a single class. Since i need the records to be in random positions in the database, i have to put them in memory in order to be able to actually generate a fully randomized sample
Considerations
I have no random field in the documents. My best bet is the Date
field of the document, which follows like this:
"CreationDate" : ISODate("2013-06-25T22:43:15.571Z")
I could get pseudo-random records by Finding the records that were created in a certain second for an example, but i could not find any way to do it, since the seconds are not a field themselves.
Thanks in advance, let me know if there's any other information i must provide.