Optimizing for exact array matching over MongoDB collection

Question

I have a collection where I only ever need to look up documents by whole arrays; I can’t think of any scenario where I’d want to look up documents by just one value of that array. Unfortunately, the multikey feature that is always activated for array values apparently can’t be deactivated.

In the documentation it says “The index will be used to lookup a subset of the values (currently the first one) and then the document will be inspected for the exact match.” I think this greatly decreases the performance in my case. Despite the index some lookups take 70 ms and some several minutes, because, depending on the first element, MongoDB sometimes has to search through a few thousand or several hundred thousand documents. At least that is my theory.

Is there some way to avoid this problem, or should I rather serialize my arrays and store them as strings?

Thanks in advance!

I know that you have marked this question as answered but there are some caveats. One thing you do not clarify is the array ordering is `[1,2,3]` considered the same as `[1,3,2]` for search? If these two arrays are equivalent, then the solution below will fail. If these two are not equivalent, then you may want to test your serialized version as there may be a dramatic speed difference. — Gates VP, Jan 19 '12 at 10:17
@GatesVP Yes, that is true. But that is not an exact array match, in my opinion. In order to match those arrays you would want to do an "$in" query, with the original array value index. — Eve Freeman, Jan 19 '12 at 13:25
@GatesVP Sorry, I meant $all. And it looks like that only measures matching one direction--not really a good way to compare (order insensitive) both ways using indexes. You could use $all and $size to get pretty close, but it wouldn't be checking number of occurrences if there are duplicates within the array. — Eve Freeman, Jan 19 '12 at 15:13
The order is important to me. I’m coming from the realm of Python where `[1, 2, 3] != [1, 3, 2]` (but `set([1, 2, 3]) == set([1, 3, 2])`). — Dawn Drescher, Feb 06 '12 at 12:28

Eve Freeman · Accepted Answer · 2012-01-18T20:39:09.293

1

Perhaps you could use a subdocument like:

{
  array_sub_doc: { arr: [1,2,3,4,5] }
}

So that you can do matches like:

db.coll.ensureIndex({array_sub_doc:1});
db.coll.find({array_sub_doc: {arr:[1,2,3,4,5]}})

update I discovered what caused the failure for large arrays. Index keys > 800 bytes will not be indexed. So, if you have a large subdocument and you put an index on it, if it's greater than 800 bytes, and you try to search for it, you won't find it. If you take the index off and search again for the same subdocument, you'll find it (although it will be a full collection scan).

This is documented here as a limitation, and will be removed in future releases: https://jira.mongodb.org/browse/SERVER-3372

So, this will work in general for small arrays.

Here's some test code in case anyone wants to try it out:

var randomArray = function() {
  var len = 80;
  var randomarr = new Array();
  for (var i=0; i<len; i++) {
    randomarr.push(Math.floor(Math.random() *10000));
  }
  return randomarr;
}

var insert = function() {
  db.Test2.ensureIndex({array_sub_doc:1});
  for(var i=0;i<10000;i++) {
    db.Test2.save({array_sub_doc: {arr: randomArray()}});
  }
}

db.Test2.remove();
insert();

var one = db.Test2.findOne();
db.Test2.findOne({array_sub_doc:one.array_sub_doc});

//...

db.Test2.find({array_sub_doc:one.array_sub_doc}).explain(0);
/* outputs:
{
  "cursor" : "BtreeCursor array_sub_doc_1",
  "nscanned" : 1,
  "nscannedObjects" : 1,
  ...
*/

edited Jan 18 '12 at 20:39

answered Jan 18 '12 at 17:59

Eve Freeman

32,467
4
86
101

I just wrote a more thorough test for this, and it doesn't seem to find a match when I use larger arrays (like 100 or 1000 elements)--works fine for 5 elements, though. Weird; maybe I did something wrong. I'll look at it some more later. – Eve Freeman Jan 18 '12 at 18:25
According to my own quick tests with short arrays, this workaround seems to properly work around my problem. Thanks! If there’s no (even) simpler solution, I’ll accept your answer as a general solution, but in my case the serialization solution seems more intuitive, as my strings can’t contain spaces and a whitespace is also very intuitive as separator. – Dawn Drescher Jan 18 '12 at 18:52
What sorts of values do your "real" arrays have? Just curious. – Eve Freeman Jan 18 '12 at 19:52
They are lists of n-grams, ordered sequences of English words, with 1 <= n <= 5. I use the NLTK Treebank tokenizer, which uses spaces as one of the markers of word boundaries, and it is rather intuitive to uses spaces to separate words. ;-) – Dawn Drescher Jan 18 '12 at 20:06
You should be fine then, with something that short. Updated above after discovering the limit listed in the documentation here: http://www.mongodb.org/display/DOCS/Indexes#Indexes-KeysTooLargeToIndex – Eve Freeman Jan 18 '12 at 20:15

Optimizing for exact array matching over MongoDB collection

1 Answers1