13

In my mongoDB book collection I have documents structured as follow :

/* 0 */
{
  "_id" : ObjectId("50485b89b30f1ea69110ff4c"),

  "publisher" : {
    "$ref" : "boohya",
    "$id" : "foo"
  },
  "displayName" : "Paris Nightlife",
  "catalogDescription" : "Some desc goes here",
  "languageCode" : "en",
  "rating" : 0,
  "status" : "LIVE",
  "thumbnailId" : ObjectId("50485b89b30f1ea69110ff4b"),
  "indexTokens" : ["Nightlife", "Paris"]
}

I perform the following regex query to find all documents having one indexToken starting with "Par" :

{ "indexTokens" : { "$regex" : "^Par" , "$options" : "i"}}

If I select only the indexTokens field to be returned like this :

{ "indexTokens" : 1}

The resulting DBObject is

{ "_id" : { "$oid" : "50485b89b30f1ea69110ff4c"} , "indexTokens" : [ "Nightlife" , "Paris"]}

What I would like to get is ONLY the token / tag that matched the regex (I don0t care about retrieving the document at this point, neither do I need all the tags of the matched document)

Is this a case for the new Aggregation Framework relesed under MongoDB v2.2. ?

If yes how do I modify my query so that the actual result would look like :

{ "indexTokens" : ["Paris", "Paradise River", "Parma" , etc ....]}

Bonus question (do you has teh codez) : How do I do it using the Java driver ?

For now my java looks like :

DBObject query = new BasicDBObject("indexTokens", java.util.regex.Pattern.compile("^"+filter+"", Pattern.CASE_INSENSITIVE));
    BasicDBObject fields = new BasicDBObject("indexTokens",1);
    DBCursor curs = getCollection()
                    .find(query, fields)
                    .sort( new BasicDBObject( "indexTokens" , 1 ))
                    .limit(maxSuggestionCount);

Thx :)

EDIT:

As per your answers I modified my JAVA code as follow :

BasicDBObject cmdBody = new BasicDBObject("aggregate", "Book"); 
    ArrayList<BasicDBObject> pipeline = new ArrayList<BasicDBObject>(); 

    BasicDBObject match = new BasicDBObject("$match", new BasicDBObject("indexTokens", java.util.regex.Pattern.compile("^"+titleFilter+"", Pattern.CASE_INSENSITIVE)));
    BasicDBObject unwind = new BasicDBObject("$unwind", "$indexTokens");
    BasicDBObject match2 = new BasicDBObject("$match", new BasicDBObject("indexTokens", java.util.regex.Pattern.compile("^"+titleFilter+"", Pattern.CASE_INSENSITIVE)));
    BasicDBObject groupFilters = new BasicDBObject("_id",null);
    groupFilters.append("indexTokens", new BasicDBObject( "$push", "$indexTokens"));
    BasicDBObject group = new BasicDBObject("$group", groupFilters);

    pipeline.add(match);
    pipeline.add(unwind);
    pipeline.add(match2);
    pipeline.add(group);

    cmdBody.put("pipeline", pipeline); 



    CommandResult res = getCollection().getDB().command(cmdBody);
    System.out.println(res);

Which outputs

{ "result" : [ { "_id" :  null  , "indexTokens" : [ "Paris"]}] , "ok" : 1.0}

This is genius !

Thanks a lot !

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
azpublic
  • 1,404
  • 4
  • 20
  • 42

2 Answers2

13

You could do this with the 2.2 aggregation framework. Something like this;

db.books.runCommand("aggregate", {
    pipeline: [
        {   // find docs that contain Par*
            $match: { "indexTokens" : { "$regex" : "^Par" , "$options" : "i"}},
        },
        {   // create a doc with a single array elemm for each indexToken entry
            $unwind: "$indexTokens" 
        },
        {   // now produce a list of index tokens
            $group: {
                _id: "$indexTokens",
            },
        },
    ],
})

Or this might be even closer to what you're after if you really want the array without the doc;

db.books.runCommand("aggregate", {
    pipeline: [
        {   // find docs that contain Par*
            $match: { "indexTokens" : { "$regex" : "^Par" , "$options" : "i"}},
        },
        {   // create a doc with a single array elemm for each indexToken entry
            $unwind: "$indexTokens" 
        },
        {   // now throw out any unwind's that DON'T contain Par*
            $match: { "indexTokens": { "$regex": "^Par", "$options": "i" } },
        },
        {   // now produce the list of index tokens
            $group: {
                _id: null,
                indexTokens: { $push: "$indexTokens" },
            },
        },
    ],
})
cirrus
  • 5,624
  • 8
  • 44
  • 62
  • You can add this to your original answer as a second solution. This way people won't be confused why there are two answers from you :) – Sammaye Sep 06 '12 at 11:00
  • Thanks to both of you it works like a charm. I added an answer to show how I did it in JAVA (I don't have the latest driver so I couldnt use the aggregate() method on DBCollection. – azpublic Sep 06 '12 at 15:40
5

Building on the response from cirrus, I recommend doing the $unwind first to avoid the redundant $match. Something like:

db.books.aggregate(
    {$unwind:"$indexTokens"},
    {$match:{indexTokens:/^Par/}},
    {$group:{_id:null,indexTokens:{$push:"$indexTokens"}}
})

How do you do this in Java? You can use the DBCollection.aggregate(...) method of the MongoDB v2.9.0 driver. Each pipeline operator, eg. $unwind or $match, corresponds to a DBObject object.

slee
  • 524
  • 2
  • 6
  • 1
    Actually, I don't think that $match is redundant. The trouble with $unwind is that it'll have to create a massive set of docs in RAM and you want to reduce that set down as early as possible. The first $match ensures that we're ONLY working with docs that even have Par* in the indexTokens before we unwind them. The second $match then cuts that set down to just the ones we want. Remember, you want to get your $match in early to reduce the pipeline volume. – cirrus Sep 06 '12 at 14:24
  • 1
    You're right. Weed out the non-matching documents, unwind the arrays, and then match again to weed out the documents that don't match the regular expression. – slee Sep 06 '12 at 14:33