1

I'm trying to find a way to get the unique values of several arrays in the same document and across documents. It's best explained with an example:

[
    {
      _id: "x",
      products: {
        product_a: ["v1", "v2"],
        product_b: ["v3", "v2"]
      }
    },
    {
      _id: "y",
      products: {
        product_a: ["v1"],
        product_b: ["v3", "v4"]
      }
    }
]

What i'm trying to get is:

  1. The number of unique values for each document. There are 3 unique values for products in 'x' and 3 unique values in 'y'.
  2. The number of unique values overall. There are 4 unique values for all documents in the collection.
Charles
  • 50,943
  • 13
  • 104
  • 142
refaelos
  • 7,927
  • 7
  • 36
  • 55

1 Answers1

2

When you are unable or unwilling to change the schema, you could do both with MapReduce

Unique values per document

Your map-function would concatenate all the arrays in products into one, remove duplicates and then emit the size of that array with the _id as key. Details about how to remove duplicates can be found in this question (ignore the answers which use libraries for web-browser javascript).

function mapFunction() {
    var ret = [];
    for (var product in this.products) {
        for (var i = 0; i < product.length; i++) {
            ret.push(product[i]);
        }
    }

    [ remove duplicates with your favorite method from question 9229645 ]

    return ret.length;
}

Your keys are unique, so your reduce-function will never be called with more than one value per key. That means it can just return the first element of the values-array.

function reduceFunction(key, values) {
    return values[0];
}

Unique values overall

You can do this by emitting each value as a key but with a meaningless value.

Your map-function would iterate the products-object, then iterate the array

 function mapFunction() {
      for (var product in this.products) {
          for (var i = 0; i < product.length; i++) {
              emit(product[i], null);
          }
      }
 }

Because the values are meaningless, your reduce-function doesn't do anything with them:

function reduceFunction(key, values) {
    return null;
}

The result will be a set of documents where each _id is one of the unique values in your data.

When you can change the schema

When there is no good reason to keep your schema the way it currently is, you could make your life much easier by turning the products object into an array:

  products: [
    { product: "product_a", values: ["v1", "v2"] },
    { product: "product_b", values: ["v3", "v2"] }
  ]

In that case you could use the aggregation-pipeline.

  1. use $unwind to turn the values-arrays into unique documents
  2. use $group with $addToSet to re-merge the documents while discarding documents
  3. use $unwind again to get a stream of unique documents again, but this time without duplicates
  4. use $group with $sum:1 to count the unique values.
Community
  • 1
  • 1
Philipp
  • 67,764
  • 9
  • 118
  • 153
  • Thanks! Map Reduce performace is really bad on mongo. I might go with your second option. – refaelos Dec 23 '13 at 05:54
  • Why change schema? You can do this with aggregation framework if you know the field/product names. – Asya Kamsky Dec 23 '13 at 06:53
  • @AsyaKamsky I would assume that refaelos has more than just two products. When you would use the aggregation-framework, you would need to $project every single product-name. When there are more than a few hundred products, that query might become a bit unwieldy. – Philipp Dec 23 '13 at 07:05
  • Several doesn't sound like hundreds, agg framework is an order of magnitude faster than MR, anything programmatic isn't unwieldy and in 2.6 AF will be able to handle this trivially via sets. – Asya Kamsky Dec 23 '13 at 07:10
  • @AsyaKamsky i do know the names of the products. Can you tell me how to do this without having to change the schema ? – refaelos Dec 23 '13 at 09:34
  • How many products do you expect? And is this code already in production or will be soon? – Asya Kamsky Dec 23 '13 at 15:01