I’m using a MongoDB database to keep track of analytics for an application. I’m writing a Clojure application (using clj-time and Monger) to get data out of the database.
I have a collection containing records like
{"_id": ObjectId(...),
timestamp: ISODate("2013-06-01T15:18:37Z"),
device: "04dbf04b6dc0d0a4fd383967b3dc62f50111e07e"}
Each different device
represents a different user of my service. What I’d like to do is to find out how many (unique) users I have each day, but with the caveat that I’d like “day” to refer specifically to the US/Central time zone, taking daylight saving into account. (If that weren’t a requirement I think I could just do something like a $group
and then a distinct
.)
Here’s what I’ve been doing:
(ns analytics.reporting
(:use [monger.core :only [connect! connect set-db! get-db]]
monger.operators
clj-time.core
clj-time.periodic
clj-time.format)
(:require [monger.collection :as mc]))
(defn to-central
[dt]
(from-time-zone dt (time-zone-for-id "America/Chicago")))
(defn count-distinct
[coll]
(count (distinct coll)))
(defn daily-usage
[ndays]
(let [midnights (map to-central
(reverse (for [offset (map days (range ndays))]
(minus (to-central (today-at 0 0)) offset))))
by-day (for [midnight midnights]
(mc/find-maps "devices" {:timestamp {$gte midnight $lt (plus midnight (days 1))}}))
devices-by-day (map #(map :device %) by-day)
distinct-devices-by-day (map count-distinct devices-by-day)]
distinct-devices-by-day))
If you can’t read Clojure, this basically says: get a list of the most recent n midnights in the Central time zone, and then run Mongo queries to find all of the records between each successive pair of midnights. Then, count the number of distinct device
s within each day.
Here’s what I don’t like about this approach:
- Running a separate query for each day (I usually look at 30 days at a time) feels wrong; this is something that should be done on the database side instead of the application side.
- Counting the distinct
device
s should also be done by the database. - My server is set to the UTC time zone, so if it’s after midnight in UTC but before midnight in Central time, the last entry in this list will always be zero. This is easy enough to patch over, but I’d prefer a solution smart enough to prevent it in the first place.
- This entire function takes about 500ms to run. That isn’t awful—I’m the only one who runs the query, and only once or twice per day—but it seems like the operation shouldn’t take that long.
Is there a way I can shove more of this logic into the MongoDB query?