I have an events
table that has an user_id
and created_at
columns.
What I'd like to figure out is the percentage of users having day-over-day repeat entries in the events
table.
So, if day d1
has user_id
s [a,b,c]
, and day d2
has user_id
s [b,d,e]
, then b
is the only repeat user_id
of the three (from d1
), and there is 33% overlap between d1
and d2
.
I'd like to be able to extend this for an arbitrary number of days.
The schema in question:
CREATE TABLE events (
events_id serial PRIMARY KEY
, user_id VARCHAR(255) NOT NULL
, created_at datetime NOT NULL
);
This is a large table, having ~25MM rows per day, with ~4.5MM distinct user_id
s per day.
Example data set:
+---------+---------------------+ | user_id | created_at | +---------+---------------------+ | bob | 2014-12-02 11:11:11 | | sally | 2014-12-02 12:12:11 | | zed | 2014-12-02 12:22:11 | | | ... | | chris | 2014-12-03 11:13:11 | | mark | 2014-12-03 11:11:13 | | zed | 2014-12-03 11:11:33 | | | ... | | sydney | 2014-12-04 11:14:11 | | zed | 2014-12-04 11:44:11 | | chris | 2014-12-04 11:44:11 | | | ... | | sydney | 2014-12-05 11:15:11 | | zed | 2014-12-05 11:55:11 | | chris | 2014-12-05 11:55:15 | | sandy | 2014-12-05 11:55:51 | | sydney | 2014-12-05 11:55:55 | +---------+---------------------+
Expected output:
+------------+---------------------------+ | day | returning_user_percentage | +------------+---------------------------+ | 2014-12-02 | NULL | | 2014-12-03 | 33 | | 2014-12-04 | 66 | | 2014-12-05 | 75 | +------------+---------------------------+
Additionally, and quite probably far more simple, part 2: I'd like to know how many new users there are each day, where "new" means the user_id
has not previously been seen.