3

Introduction/Measures

I am working with a MongoDB database with 10 GB of records (nearly 3 million records).

Each record (document) has a field called DomainClass (which is one out of 11 different classes, previously defined by us).

What i'm trying to accomplish

For statistics reasons, i have to extract from this database, 100 records of each type of DomainClass, and i can't simply get the first 100, because the sample would be biased. I need those 100 records to be randomized within the database.

What i have tried:

This is basically, what i have tried (in C#).

1 - Count the number of records that belongs to a certain DomainClass.

2 - Randomize 100 numbers between 0 and the count

3- Find all the records that belong to that DomainClass

4- Put them in memory, as a list

5 - Use all the previously randomized integers (100) as a index to this list (to solve the randomization need).

Flaws

I'm afraid that, i won't be able to allocate enough memory (RAM) for all the records of a single class. Since i need the records to be in random positions in the database, i have to put them in memory in order to be able to actually generate a fully randomized sample

Considerations

I have no random field in the documents. My best bet is the Date field of the document, which follows like this:

"CreationDate" : ISODate("2013-06-25T22:43:15.571Z")

I could get pseudo-random records by Finding the records that were created in a certain second for an example, but i could not find any way to do it, since the seconds are not a field themselves.

Thanks in advance, let me know if there's any other information i must provide.

Marcello Grechi Lins
  • 3,350
  • 8
  • 38
  • 72
  • Why should you put the whole database in memory? Just find the random numbers and query the database to get the specific document. – chaliasos Jul 16 '13 at 16:45
  • Not the whole database. But all the documents of a certain DomainClass in my case. How will i query for a specific document if i need random documents of each class? I don't think you understood my issue – Marcello Grechi Lins Jul 16 '13 at 16:48
  • Hmm that anwer below is not the best way to get a random record, in fact it is a really slow way; there are many links on google for dong this – Sammaye Jul 16 '13 at 18:56

1 Answers1

3

My approach would be:

  1. Get all the random numbers that will point to a document (not a element in a list)
  2. Run the following query for each random:

    db.collection.find().skip(random).limit(1);

Edit

For each DomainClass:

 var count = db.collection.find({DomainClass: "aClass"}).count();
 var random = Math.floor(Math.random() * count);
 var randomDoc = db.collection.find({DomainClass: "aClass"}).skip(random).limit(1);

Put this in a loop and I think it will solve your problem.

My point is to use skip and limit and get the random document direct from database. Since you want them in random order (no sorting takes place), they will have the same order as in your list. Skip and Limit will give you the same result as DomainClassList.ElementAt(index) in client side.

Leandro Bardelli
  • 10,561
  • 15
  • 79
  • 116
chaliasos
  • 9,659
  • 7
  • 50
  • 87
  • This will lead to a simple random record from the database, not to a random record of a certain DomainClass. Lets say my DomainClasses are A,B,C and D. I need 100 documents that bellongs to class A, 100 to class B and so on... – Marcello Grechi Lins Jul 16 '13 at 16:50
  • 1
    @MarcelloGrechiLins check again. I think it is more simple if you do that way. – chaliasos Jul 16 '13 at 16:59
  • The original question was in C#. Skip expects a integer, and document count come as a long. So the random documents will be only inside the first Int32 Max portion of your collection – Dorival Jun 08 '16 at 17:27