10

How can I make query with sorting by an array of string which will be execute without "stage" : "SORT" in its plan?

I'm using mongo 3.6
"mycoll" collection contains about 500.000 documents like these:

{
    someobject:{
        arrayfield:["asd","qwe"]
    }
}

{
    someobject:{
        arrayfield:["zxc"]
    }
}

this query

db.mycoll.find().sort({ "someobject.arrayfield": 1 }).skip(125340).limit(20)

produces an error

Sort operation used more than the maximum 33554432 bytes of RAM

I have and index on "someobject.arrayfield",but explain() gives me:

 "winningPlan" : {
            "stage" : "SKIP",
            "skipAmount" : 125340,
            "inputStage" : {
                    "stage" : "SORT",
                    "sortPattern" : {
                            "someobject.arrayfield" : 1
                    },
                    "limitAmount" : 125360,
                    "inputStage" : {
                            "stage" : "SORT_KEY_GENERATOR",
                            "inputStage" : {
                                    "stage" : "FETCH",
                                    "inputStage" : {
                                            "stage" : "IXSCAN",
                                            "keyPattern" : {
                                                    "someobject.arrayfield" : 1
                                            },
                                            "indexName" : "arrayfield_indexname",

                                            "isMultiKey" : true,
                                            "multiKeyPaths" : {
                                                    "someobject.arrayfield" : [
                                                            "someobject.arrayfield"
                                                    ]
                                            },
                                            "isUnique" : false,
                                            "isSparse" : false,
                                            "isPartial" : false,
                                            "indexVersion" : 2,
                                            "direction" : "forward",
                                            "indexBounds" : {
                                                    "someobject.arrayfield" : [
                                                            "[MinKey, MaxKey]"
                                                    ]
                                            }
                                    }
                            }
                    }
            }
    }

I know, I can increase the limits, use aggregation with 'allowdiskusage' or query

db.mycoll.find().sort({ "someobject.arrayfield.1": 1 }).skip(125340).limit(20)

with index on "someobject.arrayfield.1"

Vlad Mamaev
  • 565
  • 3
  • 15
  • Why do you need to skip `125340` documents? – styvane Oct 20 '18 at 11:29
  • @styvane to get 6267th page with size 20 – Vlad Mamaev Oct 22 '18 at 10:44
  • Is it not possible to apply a filter to the documents instead of skipping them? – styvane Oct 22 '18 at 11:32
  • @styvane Lets say 'no'. My goal is to get a page by its size and number from sorted result. I ommit a filter clause in the question, because I think if it's possible do this with empty find(), it will be also possible do this for filtering query with compound index. – Vlad Mamaev Oct 22 '18 at 13:24
  • Can you explain how you want to sort your documents? By number of items in 'someobject.arrayfield' array? Or something else? Can you show desired sort order in your example? I will gladly try to help you. Thnx – Aleksandar Cvetojevic Oct 22 '18 at 16:54
  • @AleksandarCvetojevic I need the default behavior of sort by array field, I want to sort by minimal element in array ascending. – Vlad Mamaev Oct 23 '18 at 06:50
  • mongo db has a known problem with indexing of arrays in subsocuments. can you put this array on the the main doc? – Amit Wagner Oct 24 '18 at 12:33
  • indexing a la `db.mycoll.createIndex({'someobject.arrayfield':1});` made this work for me without a SORT in the query plan. Are you saying that adding such an index is not an option for you? – Joshua Huber Oct 26 '18 at 15:25
  • @JoshuaHuber "I have and index on "someobject.arrayfield"" and as you can see form my explain() outpu ,it's used in execution but SORT stage is still there, did you use mongo 3.6 ? – Vlad Mamaev Oct 26 '18 at 16:01
  • 1
    @VladMamaev, you're right. This was ok in 3.4, but produces a SORT in 3.6 & 4.0. Unfortunately this is considered a "works as designed" per this mongo jira issue https://jira.mongodb.org/browse/SERVER-33387. They have even added this limitation in the documentation, see notes in https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/ and https://docs.mongodb.com/manual/core/index-multikey/#limitations. An improvement request is filed: https://jira.mongodb.org/browse/SERVER-31898 – Joshua Huber Oct 26 '18 at 18:05
  • Maybe try this https://stackoverflow.com/questions/27023622/overflow-sort-stage-buffered-data-usage-exceeds-internal-limit – smitty_werbenjagermanjensen Jan 10 '19 at 13:00

1 Answers1

2

I have a potential solution, depending on what the values in your array actually are and if you just need a stable sort, or if you require a sort based on the array-comparison logic that mongodb uses.

Skip down to the proposed solution section if you don't want to read some details about how mongodb compares arrays.


At first, I was curious exactly how a .sort() on an array field would order the results. Would it use the first array value to do the comparison? Or some combination of the values?

After some testing, it looks like mongodb uses all the values in the array to compare and order them. This was my test data (_id field omitted for brevity):

db.mycoll.find().sort({"someobject.arrayfield":1})
{ "someobject" : { "arrayfield" : [ "rty", "aaa" ] } }
{ "someobject" : { "arrayfield" : [ "xcv", "aaa", "bcd" ] } }
{ "someobject" : { "arrayfield" : [ "aaa", "xcv", "bcd" ] } }
{ "someobject" : { "arrayfield" : [ "asd", "qwe" ] } }
{ "someobject" : { "arrayfield" : [ "bnm" ] } }
{ "someobject" : { "arrayfield" : [ "dfg", "sdf" ] } }
{ "someobject" : { "arrayfield" : [ "qwe" ] } }

As you can see, it's not sorting based on the first value of the array, but instead comparing the entire array using some internal logic. How does it determine that [ "rty", "aaa" ] should come before [ "xcv", "aaa", "bcd" ] exactly? And why does [ "xcv", "aaa", "bcd" ] come before [ "aaa", "xcv", "bcd" ]? Or are they equal and it's using the _id as a tie breaker? I really don't know.

I thought maybe it was using the standard javascript comparison operators, but that doesn't appear to be the case either. I made an array of each of those arrays and called .sort() on it in and got this:

x.sort()
[ [ 'aaa', 'xcv', 'bcd' ],
  [ 'asd', 'qwe' ],
  [ 'bnm' ],
  [ 'dfg', 'sdf' ],
  [ 'qwe' ],
  [ 'rty', 'aaa' ],
  [ 'xcv', 'aaa', 'bcd' ] ]

Which makes sense, because apparently javascript array comparison joins the elements with a comma delimiter and then does a string comparison.

PROPOSED SOLUTION

The array comparison logic in mongodb is a mystery to me. But, that opens up a possibility where you might not care about mongodb's mysterious array comparison logic. If all you want is a stable sort so that you can skip and limit for pagination, then I think I have a solution for you.

If we create an index on the first value of the array, like so (using background:1 to avoid locking the database):

db.mycoll.createIndex( { "someobject.arrayfield.0":1 }, {background:1} )

Then we can perform the find query and sort on the first object in the array, which will avoid the SORT stage:

mongos> db.mycoll.find().sort({"someobject.arrayfield.0":1}).explain()

"winningPlan" : {
   "stage" : "LIMIT",
   "limitAmount" : 1,
   "inputStage" : {
      "stage" : "SKIP",
      "skipAmount" : 1,
      "inputStage" : {
         "stage" : "FETCH",
         "inputStage" : {
            "stage" : "IXSCAN",
            "keyPattern" : {
               "someobject.arrayfield.0" : 1
            },
            "indexName" : "someobject.arrayfield.0_1",
            "isMultiKey" : false,
            "multiKeyPaths" : {
               "someobject.arrayfield.0" : [ ]
            },
            "isUnique" : false,
            "isSparse" : false,
            "isPartial" : false,
            "indexVersion" : 2,
            "direction" : "forward",
            "indexBounds" : {
               "someobject.arrayfield.0" : [
                  "[MinKey, MaxKey]"
               ]
            }
         }
      }
   }
}

No more SORT stage!


This proposed solution is based on a big assumption that you are willing to accept a different sort order than your original query was providing. I hope that this solution will work and you are able to implement it this way. If not, maybe someone else can expand on this idea.

RovingBlade
  • 226
  • 1
  • 5
  • Thanks for the anwser. Unfortunately there is nothing new for me, I came to the same possible solution and I briefly mentioned it in the end of my question. – Vlad Mamaev Jan 11 '19 at 10:18
  • Whoops, sorry for repeating something you already knew! The only other idea I've had is to pre-calculate the sort order of the array to a float and store it in another field. You would have to keep that field in sync as the array values change, but then you could perform the sort on it. – RovingBlade Jan 11 '19 at 16:57