1

I have an events table that has an user_id and created_at columns.

What I'd like to figure out is the percentage of users having day-over-day repeat entries in the events table.

So, if day d1 has user_ids [a,b,c], and day d2 has user_ids [b,d,e], then b is the only repeat user_id of the three (from d1), and there is 33% overlap between d1 and d2.

I'd like to be able to extend this for an arbitrary number of days.

The schema in question:

CREATE TABLE events (
  events_id serial PRIMARY KEY
, user_id VARCHAR(255) NOT NULL
, created_at datetime NOT NULL
);

This is a large table, having ~25MM rows per day, with ~4.5MM distinct user_ids per day.

Example data set:

+---------+---------------------+
| user_id |     created_at      |
+---------+---------------------+
| bob     | 2014-12-02 11:11:11 |
| sally   | 2014-12-02 12:12:11 |
| zed     | 2014-12-02 12:22:11 |
|         | ...                 |
| chris   | 2014-12-03 11:13:11 |
| mark    | 2014-12-03 11:11:13 |
| zed     | 2014-12-03 11:11:33 |
|         | ...                 |
| sydney  | 2014-12-04 11:14:11 |
| zed     | 2014-12-04 11:44:11 |
| chris   | 2014-12-04 11:44:11 |
|         | ...                 |
| sydney  | 2014-12-05 11:15:11 |
| zed     | 2014-12-05 11:55:11 |
| chris   | 2014-12-05 11:55:15 |
| sandy   | 2014-12-05 11:55:51 |
| sydney  | 2014-12-05 11:55:55 |
+---------+---------------------+

Expected output:

+------------+---------------------------+
|    day     | returning_user_percentage |
+------------+---------------------------+
| 2014-12-02 | NULL                      |
| 2014-12-03 | 33                        |
| 2014-12-04 | 66                        |
| 2014-12-05 | 75                        |
+------------+---------------------------+

Additionally, and quite probably far more simple, part 2: I'd like to know how many new users there are each day, where "new" means the user_id has not previously been seen.

findchris
  • 1,704
  • 4
  • 14
  • 13
  • The question is not entirely clear, and either way it would be essential to know whether users can have multiple events per day. A ***table definition*** would clarify that (among other things). – Erwin Brandstetter Dec 02 '14 at 05:57
  • Updated question with much more detail. Thanks. – findchris Dec 02 '14 at 20:52
  • 1
    (1) I believe that is what my answer does; (2) You should stick to one question in a question. If you want to ask about new customers, ask another question. – Gordon Linoff Dec 02 '14 at 21:44
  • So there can be any number of entries for the same user for a single day? How do you calculate **60 %** in the last line exactly? **100 %** of users from 12-04 appear 12-05, **75 %** of the users 12-05 already appeared 12-04 and **80 %** of the *entries* are with users that appeared the previous day. Either way, it doesn't add up for me. – Erwin Brandstetter Dec 02 '14 at 21:56
  • @ErwinBrandstetter, good catch, and my apologies: 75% is correct, and I've updated accordingly. 3 of the 4 user_ids from 12-04 were seen on 12-05. And yes: There can be multiple entries for the same user for a single day. – findchris Dec 02 '14 at 23:04

2 Answers2

1

Assuming that created_at is a date with no time component:

select e.created_at,
       avg(case when eprev.user_id is not null then 1.0 else 0.0 end) as overlap
from events e left join
     events eprev
     on e.created_at = eprev.created_at + interval '1' day and e.user_id = eprev.user_id
group by e.created_at
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • This doesn't count a "percentage of users having day-over-day repeat entries" so much as the percentage of "day-over-day repeat entries" per day - which is something completely different. Quite possible that the OP really wants that (the overlap he mentions points that way), but he stated differently. – Erwin Brandstetter Dec 02 '14 at 05:51
  • Thanks for the response. See updated question above. – findchris Dec 02 '14 at 20:52
  • If you want values from 0 to 100, change `1.0` to `100.0`. – Gordon Linoff Dec 02 '14 at 21:45
0

Answer to updated question:

"Calculate for every day the percentage of distinct users that have entries for the previous day as well."

WITH e AS (SELECT created_at::date, user_id AS day FROM events GROUP BY 1, 2)
SELECT e.day
     , round(100.0 * count(e1.user_id) / count(*), 2) AS pct_repeat_user
FROM   e
LEFT   JOIN e e1 ON e1.user_id = e.user_id
                AND e1.day = e.day - 1
GROUP  BY 1
ORDER  BY 1;

Returns your desired result - but 0 instead of NULL for the first day, which seems more correct to me.

Explain

  • Your "date" is actually a timestamp (not "datetime"). To get unique users per day, I cast to date and group users in CTE u. This is crucial or you get nonsensical results from cross-joining duplicate users on one day with duplicate users on the previous day.

  • Building on that, LEFT JOIN to the previous day. Count users that can be found "yesterday" as well and divide by the number of users "today". Multiply by 100.0 to coerce the number to numeric before dividing and use round() to get your desired number of fractional digits.

  • Since day is data type date now you can simply subtract an integer 1 from it to get "yesterday.

  • Be aware that a "day" is also defined by the time zone - if you should have data from multiple time zones. Does not act up as long as you just store timestamp, but there may be an inherent error in your data. Details:

SQL Fiddle.

DB design

For millions of rows and multiple entries per user I would urgently advice to create a separate users table and reference it to reduce disk space and improve performance:

CREATE TABLE users (
  user_id serial PRIMARY KEY
, username text NOT NULL
);

CREATE TABLE events (
  events_id serial PRIMARY KEY
, user_id int NOT NULL REFERENCES users
, created_at timestamp NOT NULL
);

This would also instrumental to speed up a number or other queries.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks Erwin. In your first query, shouldn't the `pct_repeat_user` subquery be bounded by day, else I'll get a distinct count of all `user_id`s in the table. In the second query, unfortunately, I do not have a `users` table that I can join against. I'm still playing with your query to see if it gives the results I want. – findchris Dec 02 '14 at 20:23
  • @findchris: Consider the new answer for your clarified question. – Erwin Brandstetter Dec 02 '14 at 23:43
  • This looks good @Erwin. Thank you. I had thought I'd have to use something more complicated, such as `RANK` and `LAG`, but your solution is nice and simple. – findchris Dec 03 '14 at 19:46