19

I have a Mongodb collection. Simply, it has two columns: user and url. It has 39274590 rows. The key of this table is {user, url}.

Using Java, I try to list distinct urls:

  MongoDBManager db = new MongoDBManager( "Website", "UserLog" );
  return db.getDistinct("url"); 

But I receive an exception:

Exception in thread "main" com.mongodb.CommandResult$CommandFailure: command failed [distinct]: 
{ "serverUsed" : "localhost/127.0.0.1:27017" , "errmsg" : "exception: distinct too big, 16mb cap" , "code" : 10044 , "ok" : 0.0}

How can I solve this problem? Is there any plan B that can avoid this problem?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Munichong
  • 3,861
  • 14
  • 48
  • 69

4 Answers4

12

In version 2.6 you can use the aggregate commands to produce a separate collection: http://docs.mongodb.org/manual/reference/operator/aggregation/out/

This will get around mongodb's limit of 16mb for most queries. You can read more about using the aggregation framework on large datasets in mongodb 2.6 here: http://vladmihalcea.com/mongodb-2-6-is-out/

To do a 'distinct' query with the aggregation framework, group by the field.

db.userlog.aggregate([{$group: {_id: '$url'} }]); 

Note: I don't know how this works for the Java driver, good luck.

Vlad Mihalcea
  • 142,745
  • 71
  • 566
  • 911
Will Shaver
  • 12,471
  • 5
  • 49
  • 64
3

Take a look at this answer

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

Community
  • 1
  • 1
gmaniac
  • 940
  • 1
  • 17
  • 33
2

If you are using mongodb 3.0 and above you can use DistinctIterable class with batchSize.

MongoCollection coll = null;
coll = mongodb.getCollection("mycollection");
DistinctIterable<String> ids = coll.distinct("id", String.class).batchSize(100);
for (String id: ids) {
    System.out.println("" + id);
}

http://api.mongodb.com/java/current/com/mongodb/client/DistinctIterable.html

Ankit Marothi
  • 955
  • 10
  • 14
0

Version 3.x on Groovy :

import com.mongodb.client.AggregateIterable
import com.mongodb.client.MongoCollection
import com.mongodb.client.MongoCursor
import com.mongodb.client.MongoDatabase
import static com.mongodb.client.model.Accumulators.sum
import static com.mongodb.client.model.Aggregates.group
import static java.util.Arrays.asList
import org.bson.Document

//other code

AggregateIterable<Document> iterable = collection.aggregate(
    asList(
        group("\$" + "url", sum("count", 1))
    )
).allowDiskUse(true)

MongoCursor cursor = iterable.iterator()

while(cursor.hasNext()) {
    Document doc = cursor.next()
    println(doc.toJson())
}
Tommy Ng
  • 438
  • 1
  • 4
  • 10