Get unique keys and distinct list of values for each

Question

I have a set of documents like this:

Input

[
  { color: "red", size: "small" },
  { color: "blue", size: "small" },
  { color: "red", size: "medium" },
  { color: "green", size: "medium" },
  { color: "black", size: "large" }
];

I want to create a set that is made up of each key, and the distinct values for each key:

Output

[
  { name: "color", values: ["red", "blue", "green", "black"] },
  { name: "size", values: ["small" "medium", "large"] }
]

I won't know what the keys of the input document will be.

I know how to solve the two problems separately:

Get the keys of an arbitrary document by following this answer
Use the aggregation frameworks $addToSet operator to get the distinct list of values for each key.

I would like to do this in one pass. I would think that it's possible to use the aggregation framework for step 1, and pipe that into step 2, but maybe not....

thanks

Are you trying to do this per document somehow or are you trying to get a result across a whole collection? If those listed values at the top are part of a single document, then what is the key? Can we assume `_id`? — Neil Lunn, Mar 21 '14 at 00:51
To answer your question --- you can't do that in one step, without knowing the field names in advance. In fact, as you saw from the 1st answer you linked to, there's not a natural way to gather the field names from a document. Once you've gathered those, the second step, as you said, is simple. — WiredPrairie, Mar 21 '14 at 01:29

score 1 · Answer 1 · answered Mar 21 '14 at 01:35

As I mentioned in a comment, without knowing the field names in advance, you can't do this in a single pass ..., unless you were willing to consider a different schema.

Here's an idea for example with a different schema that collects the same data, but as keys and values:

{ values : [  {  "k" : "color",  "v" : "red" }, 
              {  "k" : "size",  "v" : "small" } ] }
{ values : [  {  "k" : "color",  "v" : "blue" }, 
              {  "k" : "size",  "v" : "small" } ] }
{ values : [  {  "k" : "color",  "v" : "red" },
              {  "k" : "size",  "v" : "medium" } ] }
{ values : [  {  "k" : "color",  "v" : "green" }, 
              {  "k" : "size",  "v" : "medium" } ] }
{ values : [  {  "k" : "color",  "v" : "black" },
              {  "k" : "size",  "v" : "large" } ] }

The aggregation would be trivial, as it will just group on the key name and use $addToSet to collect the values.

> db.test.aggregate({ $unwind : '$values' }, 
     { $group : 
          {  _id : "$values.k", 
            value: { $addToSet: "$values.v" } } })
{
        "result" : [
                {
                        "_id" : "size",
                        "value" : [
                                "large",
                                "medium",
                                "small"
                        ]
                },
                {
                        "_id" : "color",
                        "value" : [
                                "black",
                                "green",
                                "blue",
                                "red"
                        ]
                }
        ],
        "ok" : 1
}

Thanks for confirming in your comment that the awkward first step can't be solved. I'm not too keen on changing the way the documents are stored - I'd prefer them to stay as "natural" to the business as possible. This _is_ a solution that would solve my problem. I just need to weigh "two steps" vs. "readable storage". — M Falanga, Mar 21 '14 at 01:52
@MFalanga For what it is worth, there generally are benefits of re-considering your schema design from original conceptions. And it is something that MongoDB is very forgiving with. In a [longer form](http://stackoverflow.com/q/22301716/2313887) of this the general conclusion is that you can do "fancy" things, or change your schema and simplify. As mostly speaking the "fancy" way is just trying to re-shape the result into what the simple way already does. — Neil Lunn, Mar 21 '14 at 02:18

Neil Lunn · Answer 2 · 2014-03-21T04:14:37.083

I suppose that one way to do it would be entirely in mapReduce:

First a mapper:

var mapper = function () {

  for ( var k in this ) {
    if ( k != '_id' )
      emit( { name: k }, this[k] );
  }

};

And then for the reducer:

var reducer = function ( key, values ) {

    var unique = [];

    Array.prototype.inArray = function(value) {
        for( var i=0; i < this.length; i++) {
            if ( this[i] == value ) return true;
        }
        return false;
    };

    Array.prototype.addToSet = function(value) {
        if ( this.length == 0 ) {
            this.push(value);
        } else if ( !this.inArray(value) ) {
            this.push(value);
        }
    };

    values.forEach(function(value) {
        unique.addToSet(value);
    });

    return { values: unique };

};

And then run the operation for the output:

db.collection.mapReduce(mapper,reducer,{ out: { inline: 1 } })

Which gives "nice" mapReduce style output:

{
    "results" : [
            {
                    "_id" : {
                            "name" : "color"
                    },
                    "value" : {
                            "values" : [
                                    "red",
                                    "blue",
                                    "green",
                                    "black"
                            ]
                    }
            },
            {
                    "_id" : {
                            "name" : "size"
                    },
                    "value" : {
                            "values" : [
                                    "small",
                                    "medium",
                                    "large"
                            ]
                    }
            }
    ],
    "timeMillis" : 2,
    "counts" : {
            "input" : 5,
            "emit" : 10,
            "reduce" : 2,
            "output" : 2
    },
    "ok" : 1,
}

As long as you are fine with generating the keys, then you can build like this:

They way you are listing things would make things more difficult, but what would be wrong with the following:

db.collection.aggregate([
    { "$group": {
        "_id": false,
        "size": { "$addToSet": "$size" },
        "color": { "$addToSet": "$color" }
    }}
])

Which gives the result:

{
    "result" : [
            {
                    "_id" : false,
                    "size" : [
                            "large",
                            "medium",
                            "small"
                    ],
                    "color" : [
                            "black",
                            "green",
                            "blue",
                            "red"
                    ]
            }
    ],
    "ok" : 1
}

So you do have the two distinct sets in one pass.

Doing it how you present it is possible, but just more work:

db.collection.aggregate([
    // Project with the "name" as an array of possible
    { "$project": {
          "size": 1,
          "color": 1,
          "name": { "$cond": [ 1, [ "size", "color" ], 0 ] }
    }},

    // Unwind the "name" values. Create duplicates
    { "$unwind": "$name" },

    // Conditionally assign the fields to "value"
    { "$project": {
        "name": 1,
        "value": {"$cond": [
            { "$eq": [ "$name", "size"] },
            "$size",
            "$color"                
        ]}
    }},

    // Group the results by name
    { "$group": {
        "_id": "$name",
        "values": { "$addToSet": "$value" },
    }},

    // Project the fields you want
    { "$project": {
        "_id": 0,
        "name": "$_id",
        "values": 1
    }}
])

And that gives you your projected results.

There is a "funny" use of $cond in there, that should be able to be replaced with with $literal operator in future versions where the "name" gets assigned. After that assigned array is unwound there is now two of everything, but that doesn't matter because of the $addToSet operation later.

So then the "value" gets conditionally assigned based on what matched. Group the results on name, and you have two documents keyed by name with the respective values.

Enjoy.

Your first block is what I came to, but I don't know what attributes (size and color) exist on the document. My current solution is to build that block dynamically. — M Falanga, Mar 21 '14 at 01:57
@MFalanga Then building dynamically is probably going to be the best if you do not want to change schema, and that may be the best option for your case if that is how you need to work with the data. — Neil Lunn, Mar 21 '14 at 02:03
@MFalanga But if your still fishing to keep the schema then there is a mapReduce **only** way to get the results for arbitrary keys. See the edit. — Neil Lunn, Mar 21 '14 at 04:15

Get unique keys and distinct list of values for each

2 Answers2