0

I have a collection in mongo which stores clickstream data for each day It has a structure like -

{"utcDate": ISODate(date1), "userToken":"user-id1", ..}
{"utcDate": ISODate(date1), "userToken":"user-id2", ..}
{"utcDate": ISODate(date2), "userToken":"user-id1", ..}
{"utcDate": ISODate(date2), "userToken":"user-id2", ..}

I am trying to get daily active users, within a certain date range. This is my current query -

[
  {
    "$project": {
      "utcDate~~~day": {
        "$let": {
          "vars": {
            "column": "$utcDate"
          },
          "in": {
            "___date": {
              "$dateToString": {
                "format": "%Y-%m-%d",
                "date": "$$column"
              }
            }
          }
        }
      },
      "userToken": "$userToken"
    }
  },
  {
    "$match": {
      "utcDate~~~day": {
        "$gte": {
          "___date": "2019-04-01"
        },
        "$lte": {
          "___date": "2019-04-08"
        }
      }
    }
  },
  {
    "$project": {
      "_id": "$_id",
      "___group": {
        "utcDate~~~day": "$utcDate~~~day"
      },
      "userToken": "$userToken"
    }
  },
  {
    "$group": {
      "_id": "$___group",
      "count": {
        "$addToSet": "$userToken"
      }
    }
  },
  {
    "$sort": {
      "_id": 1
    }
  },
  {
    "$project": {
      "_id": false,
      "utcDate~~~day": "$_id.utcDate~~~day",
      "count": {
        "$size": "$count"
      }
    }
  },
  {
    "$sort": {
      "utcDate~~~day": -1
    }
  }
]

How do I optimize this query? I currently have an index on utcDate and userToken, I read that compound indexes help in this, what should my index look like?

These are my current indexes -

[

    {
        "v" : 2,
        "key" : {
            "userToken" : 1
        },
        "name" : "userToken_1",
        "ns" : "events.user_events"
    },
    {
        "v" : 2,
        "key" : {
            "utcDate" : 1
        },
        "name" : "utcDate_1",
        "ns" : "events.user_events"
    }
]
n00b
  • 1,549
  • 2
  • 14
  • 33
  • 2
    So you have several concepts horribly incorrect. 1. **Always** match as the **first** pipeline stage. This is the only place you can use an index and you also *should* have an index on the `utcDate` field if you want to query on it. Big optimization there. 2. Your main problem is using `$project` in the first stage to "concatenate" dates. This is incorrect. Use date ranges as shown in the initial link. ie. `new Date("2019-04-01")` matches the existing BSON dates as stored with no conversion. 3. `$dateToString` is a horrible hack. Grouping date intervals can be done with math and other methods – Neil Lunn Apr 08 '19 at 08:55
  • 2
    The most performant "date rounding" is actually via either `$dateFromParts` ( helped by other date aggregation operators for the "parts" ) or the pure mathmatical approach as [demonstrated in the final example.](https://stackoverflow.com/a/26814496/2313887) – Neil Lunn Apr 08 '19 at 08:57
  • @NeilLunn Ah, thank you! I actually used an analytics tool called Metabase, and copied the query directly from there, but it was running very slow so thought something must be wrong. Will check your solutions, thanks! – n00b Apr 08 '19 at 09:02
  • Hmmn. Well it does indeed *"look like"* ( or *"smells like"* ) are rather **literal translation** of a SQL dialect into MongoDB aggregation framework parlance. You do such "horrible" things of course with stuff like SQLite for similar "date aggregation" processes. And of course the `SELECT` in SQL is analogous to `$project` in many cases, with the clear exception as noted that an "aggregation pipeline" really want's `$match` before any `$project`, *IF* you even need `$project` at all. I suggest you do not as `$group` basically does "field selection" for your output. – Neil Lunn Apr 08 '19 at 09:08
  • Ah yes, that makes more sense – n00b Apr 08 '19 at 09:28

0 Answers0