1

I have a collection that has a createdAt field in it. When I get multiple documents from this collection, I use this field to sort them from the newest to the oldest. The thing is, that multiple documents can somehow be created at the same millisecond and therefore have the exact same date. Because of that, I need my sorting to have some tie-breaker for those situations (I don't care what order they will be, but it has to be consistent). I tried using the _id for that, and sorting using {createdAt: -1, _id: -1}, but it creates problems with the use of indexes with some queries, making some very common queries slower and inefficient.

Then, I thought that instead of saving createdAt as a Date object, I can just save it as a string (ISO 8601) and then add a random string at the end. That way even if multiple documents have the same Date, thanks to the random string, they will still have a consistent order. It's like a built-in tie-breaker that makes it very simple to work with sorting or getting all of the documents that were created before some other document (something that I use a lot for pagination). The only thing that it adds to my code is that when I want to get the actual date, I need to remove that random string and convert it to Date, but this is very easily done with new Date(createdAt.split('Z')[0] + 'Z').

Does this approach has any downsides? Something that I need to consider?

R7B9
  • 13
  • 1
  • 5
  • 1
    "but it creates problems with the use of indexes with some queries" ... What sort of problems? Do you have an index on `{createdAt: -1, _id: -1}`? – rickhg12hs Oct 13 '22 at 18:22
  • @rickhg12hs Yes I have an index. The problem is when trying to get documents that were created before another document, because I have to use $or (either the date was before the document date, or the two dates are equal and the _id is smaller). Because of that, mongo has to do MERGE_SORT (without using the index). The problem is even bigger when trying to also search with another field. Even if I have a compound index for all three fields, it can't both do an index scan with that compound index, and also sort by only the 2 of them (the third field is an array, so I can't add it to the sort). – R7B9 Oct 13 '22 at 18:44
  • `MERGE_SORT` implies that the database _is_ using the index to gather the data in sorted order (and just merging or "zipping" the two sorted streams together). Can you upload a verbose explain (`.explain("allPlansExecution")`) that demonstrates the problem? In fairness, I do usually recommend against storing dates as strings, but I do want to be sure that we are looking at the right problem here before trying to providing solutions for the wrong thing. – user20042973 Oct 13 '22 at 19:35
  • Does this answer your question? [Difference between "id" and "\_id" fields in MongoDB](https://stackoverflow.com/questions/9694460/difference-between-id-and-id-fields-in-mongodb). The generation of ObjectId actually encapsulate the creation time inside. You can use `$toDate` to get the time if needed. And `_id` field is actually indexed, so performance should be a less concern here. – ray Oct 13 '22 at 20:02
  • @user20042973 How should I upload it? I tried editing the post and add it but it's too long so StackOverflow doesn't let me... – R7B9 Oct 13 '22 at 20:19
  • @ray thanks but it's not really my question... I want to sort with the createdAt field, not the _id field, and it is not only the sorting but also I want to get all of the documents that were created before some given document (for pagination). Actually, I don't even use ObjectId for the _id but a string that I generate (mainly because I just want shorter ids) – R7B9 Oct 13 '22 at 20:24
  • Perhaps create a gist on github or use something like pastebin.com? – user20042973 Oct 13 '22 at 20:49
  • Storing date values as string is usually a very bad idea, actually you should **never** do it. Like for others it is really not clear why sort by `{createdAt: -1, _id: -1}` should give any problems. – Wernfried Domscheit Oct 14 '22 at 06:25
  • @user20042973 here is the result of this query: `db.questions.find({tags: "some_tag", $or: [{createdAt: {$lt: ISODate('2022-10-12T17:39:49.774Z')}}, {createdAt: ISODate('2022-10-12T17:39:49.774Z'), _id: 'F3zcf4Ll0R'}]}).limit(20).sort({createdAt: -1, _id: -1}).explain("allPlansExecution")` [link](https://pastebin.com/VtZrh1Lu) I have both index for createdAt and _id, and an index for tags (array), createdAt and _id. – R7B9 Oct 14 '22 at 10:26
  • @WernfriedDomscheit as I said the problem is mostly when having another field to search in, I shared here the result of the .explain of some query for example. – R7B9 Oct 14 '22 at 10:30

2 Answers2

0

Sounds like you might want to consider using Universally Unique Lexicographically Sortable Identifier (ULID).

It will create a random ID, based on the timestamp. It encodes the timestamp and appends a random part to avoid conflicts for the exact same date.

const {ulid} = require('ulid');

ulid(new Date('2022-01-01').valueOf()); // 01FR9EZ700AT3P6EA9PW5K9NQ8
ulid(new Date('2022-01-01').valueOf()); // 01FR9EZ700NQ1RP84E34S1W6X1

Note, if I generate a ulid for the same data again, the first 10 bytes of the string will remain the same, the rest will be changed only. This is because of the ULID format:

 01AN4Z07BY      79KA1307SR9X4MV3

|----------|    |----------------|
 Timestamp          Randomness
   48bits             80bits

So in your case, instead of having a compound index for {createdAt: -1, _id:1}, you'll have a simple index based on 1 field that stores a ULID.

But wait, how do you search based on some date then? How do you decode the original date back?

Searching documents larger than a date:

const {encodeTime} = require('ulid');

const datePrefix = encodeTime(new Date('2022-01-01').valueOf(), 10);

db.collection.find({ulidField: {$gte: datePrefix}})

Returning the original date back:

const {decodeTime} = require('ulid');

new Date(decodeTime('01FR9EZ700AT3P6EA9PW5K9NQ8')); // Sat Jan 01 2022 01:00:00 GMT+0100 (Central European Standard Time)
Vlad Holubiev
  • 4,876
  • 7
  • 44
  • 59
  • What are the benefits of using it instead of just saving ISO date + random string? It seems to me like it only has downsides: less readability and more storage space. Am I missing something? – R7B9 Oct 13 '22 at 19:54
  • @R7B9 index size will be smaller. As you don't need to have a compound index for _id+date – Vlad Holubiev Oct 14 '22 at 12:56
  • No, I meant that I'll have one field that will save a string that combines the date in ISO format + some random string. So for example the string could look something like this: `'2022-10-14T13:03:36.042ZbL6h3'`. What is the benefit of using a library to encode it when I can easily create it myself with more readble and shorter format? – R7B9 Oct 14 '22 at 13:05
0

The index bounds are not optimal, as you see in the explain plan:

indexBounds: {
    tags: [ '["some_tag", "some_tag"]' ],
    createdAt: [ '[MaxKey, MinKey]' ],
    _id: [ '[MaxKey, MinKey]' ]
}

Rewrite the query and try this one:

db.questions.find(
   {
      $or: [
         { tags: "some_tag", createdAt: { $lt: ISODate('2022-10-12T17:39:49.774Z') } },
         { tags: "some_tag", createdAt: ISODate('2022-10-12T17:39:49.774Z'), _id: 'F3zcf4Ll0R' }
      ]
   }
).limit(20).sort({ createdAt: -1, _id: -1 }).explain("allPlansExecution")

When you create index as db.questions.createIndex({ tags: 1, createdAt: -1 }) then you get this plan:

{
  stage: 'OR',
  inputStages: [
    {
      stage: 'FETCH',
      filter: {
        '$and': [
          { createdAt: { '$eq': ISODate("2022-10-12T17:39:49.774Z") } },
          { tags: { '$eq': 'some_tag' } }
        ]
      },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: { _id: 1 },
        indexName: '_id_',
        isMultiKey: false,
        multiKeyPaths: { _id: [] },
        isUnique: true,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { _id: [ '["F3zcf4Ll0R", "F3zcf4Ll0R"]' ] }
      }
    },
    {
      stage: 'IXSCAN',
      keyPattern: { tags: 1, createdAt: -1 },
      indexName: 'tags_1_createdAt_-1',
      isMultiKey: false,
      multiKeyPaths: { tags: [], createdAt: [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: {
        tags: [ '["some_tag", "some_tag"]' ],
        createdAt: [ '(new Date(1665596389774), new Date(-9223372036854775808)]' ]
      }
    }
  ]
}

Field _id is always unique (and always has an index), so it makes sense for MongoDB to query on this field and run OR with the second condition.

Thus, you could also write:

db.questions.find(
   {
      $or: [
         { tags: "some_tag", createdAt: { $lt: ISODate('2022-10-12T17:39:49.774Z') } },
         { _id: 'F3zcf4Ll0R' }
      ]
   }
).limit(20).sort({ createdAt: -1, _id: -1 }).explain("allPlansExecution")
Wernfried Domscheit
  • 54,457
  • 9
  • 76
  • 110
  • Thank you so much! I'll try to put the tag search inside each side of the or. I now see that I also made a mistake when sending here the query... Sorry, the second condition inside the $or should be `{createdAt: ISODate('2022-10-12T17:39:49.774Z'), _id: {$lt: 'F3zcf4Ll0R'}}`. Will fix those two. – R7B9 Oct 14 '22 at 14:45
  • Thanks for walking this one forward @Wernfried Domscheit. My only additional comment is that the index definition should have `_id: -1` appended to it. This will allow the database to generate a `SORT_MERGE` plan rather than an `OR` with a blocking `SORT` – user20042973 Oct 14 '22 at 15:08
  • If the proposed solution works, then please accept the answer. – Wernfried Domscheit Oct 14 '22 at 15:08