0

I have a ratings table, where each user can add one rating a day. But each user might miss several days between ratings.

I'd like to get the average rating for each user_id's first 7 entries of created_at.

My table:

mysql> desc entries;
+------------+------------------+------+-----+---------+----------------+
| Field      | Type             | Null | Key | Default | Extra          |
+------------+------------------+------+-----+---------+----------------+
| id         | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| rating     | tinyint(4)       | NO   |     | NULL    |                |
| user_id    | int(10) unsigned | NO   | MUL | NULL    |                |
| created_at | timestamp        | YES  |     | NULL    |                |
+------------+------------------+------+-----+---------+----------------+

Ideally I'd just get something like:

+------------+------------------+
| day        | average_rating   |
+------------+------------------+
| 1          | 2.53             |
+------------+------------------+
| 2          | 4.30             |
+------------+------------------+
| 3          | 3.67             |
+------------+------------------+
| 4          | 5.50             |
+------------+------------------+
| 5          | 7.23             |
+------------+------------------+
| 6          | 6.98             |
+------------+------------------+
| 7          | 7.22             |
+------------+------------------+

The closest I've been able to get is:

SELECT rating, user_id, created_at FROM entries ORDER BY user_id asc, created at desc

Which isn't very close at all...

Is it even possible? Will the performance be terrible? It's something that would need to run every time a web page is loaded, so would it be better to just run this once a day and save the results? (to another table!?)

edit - second attempt

Working towards a solution, I think this would get the rating for each user's first day:

select rating from entries where user_id in
    (select user_id from entries order by created_at limit 1);

But I get:

ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So now I'm going to play around with JOIN to see if that helps.

edit - third attempt, getting closer

I found this stackoverflow post, which is closer to what I want.

select e1.* from entries e1 left join entries e2 
on (e1.user_id = e2.user_id and e1.created_at > e2.created_at) 
where e2.id is null;

It gets the rating for the first day for each user.

Next step is to work out how to get days 2 to 7. I can't use 1.created_at > e2.created_at for that, so I'm really confused now.

edit - fourth attempt

Okay, I think it's not possible. Once I worked out how to turn off 'full group by' mode, I realised I'll probably need to use a subquery with limit <user_id>, <day_num>, for which I get:

ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

My current method is to just get the entire table, and use PHP to calculate the average for each day.

Community
  • 1
  • 1
david_nash
  • 678
  • 7
  • 14
  • 2
    You're going to need row number here, as well as a calendar table to fill in the missing days for a given user, should they occur. – Tim Biegeleisen Feb 14 '17 at 04:32
  • Okay thanks - you're saying add a day_number column to the table? Not sure what you mean by the calendar table. If user1 first rates on Jan 1 and next Jan 2, and user2 rates first on March 1 and then August 1, I need the average rating for all users first and second days, regardless of when they occur. – david_nash Feb 14 '17 at 06:23
  • The missing dates have to come from _somewhere_, since they're not in your original table. And the row numbers also have to come from somewhere. – Tim Biegeleisen Feb 14 '17 at 06:24
  • I'm still not sure I understand. If a user signs 5 years after the first user, do I need to create a row for the missing dates for every day the user missed? Those missing dates aren't going to effect the average for "Day N" of all users. I would think that MySQL would be able to look at the first rating for all users, and tell me the average rating without needing a separate field for missing dates or the day_number (which would always be 1 for the earliest rating date, and N for the Nth rating date.). Maybe I'm not able to explain this clearly because I don't understand it well enough :) – david_nash Feb 14 '17 at 10:10
  • In order to calculate and assign averages for the first N days, you first need to have records representing each day. Note that if on a certain day, e.g. Christmas, no users had any activity, then there is NO way to report ANYTHING for this day. Hence, you need a calendar table. – Tim Biegeleisen Feb 14 '17 at 10:19
  • Okay thanks Tim. I'm not interested in the calendar. The result I want is a graph showing how user's ratings change over the first 7 days of their use of the product. The idea is, they start off low and by the end they love it. Regardless of when any user starts or finishes. The graph won't be labelled with dates, it'll be labelled with "Day 1" etc (see my ideal results table). Hope this makes sense! And thanks for your working this through with me. – david_nash Feb 14 '17 at 19:22

1 Answers1

1

If I understand correctly you want to take the last 7 ratings the user gave, ordered by the date they gave the rating. The last 7 ratings of one user may fall on different days to another user, however they will be averaged together regardless of date.

First we need to order the data by user and date and give each user their own incrementing row count. I do this by adding two variables, one for the last user id and one for the row number:

select e.created_at, 
  e.rating, 
  if(@lastUser=user_id,@row := @row+1, @row:=1) as row, 
  @lastUser:= e.user_id as user_id 
from entries e, 
  ( select @row := 0, @lastUser := 0 ) vars 
order by e.user_id asc, 
  e.created_at desc;

If the previous user_id is different we reset the row counter to 1. The result from this is:

+---------------------+--------+------+---------+
| created_at          | rating | row  | user_id |
+---------------------+--------+------+---------+
| 2017-01-10 00:00:00 |      1 |    1 |       1 |
| 2017-01-09 00:00:00 |      1 |    2 |       1 |
| 2017-01-08 00:00:00 |      1 |    3 |       1 |
| 2017-01-07 00:00:00 |      1 |    4 |       1 |
| 2017-01-06 00:00:00 |      1 |    5 |       1 |
| 2017-01-05 00:00:00 |      1 |    6 |       1 |
| 2017-01-04 00:00:00 |      1 |    7 |       1 |
| 2017-01-03 00:00:00 |      1 |    8 |       1 |
| 2017-01-02 00:00:00 |      1 |    9 |       1 |
| 2017-01-01 00:00:00 |      1 |   10 |       1 |
| 2017-01-13 00:00:00 |      1 |    1 |       2 |
| 2017-01-11 00:00:00 |      1 |    2 |       2 |
| 2017-01-09 00:00:00 |      1 |    3 |       2 |
| 2017-01-07 00:00:00 |      1 |    4 |       2 |
| 2017-01-05 00:00:00 |      1 |    5 |       2 |
| 2017-01-03 00:00:00 |      1 |    6 |       2 |
| 2017-01-01 00:00:00 |      1 |    7 |       2 |
| 2017-01-13 00:00:00 |      1 |    1 |       3 |
| 2017-01-01 00:00:00 |      1 |    2 |       3 |
| 2017-01-03 00:00:00 |      1 |    1 |       4 |
| 2017-01-01 00:00:00 |      1 |    2 |       4 |
| 2017-01-02 00:00:00 |      1 |    1 |       5 |
+---------------------+--------+------+---------+

We now simply wrap this in another statement to select the avg where the row number is less than or equal to seven.

select e1.row day, avg(e1.rating) avg
from (
  select e.created_at,
    e.rating,
    if(@lastUser=user_id,@row := @row+1, @row:=1) as row,
    @lastUser:= e.user_id as user_id
  from entries e,
    ( select @row    := 0, @lastUser := 0 ) vars
  order by e.user_id asc,
    e.created_at desc) e1
where e1.row <=7 
group by e1.row;

This outputs:

+------+--------+
| day  | avg    |
+------+--------+
|    1 | 1.0000 |
|    2 | 1.0000 |
|    3 | 1.0000 |
|    4 | 1.0000 |
|    5 | 1.0000 |
|    6 | 1.0000 |
|    7 | 1.0000 |
+------+--------+
roblovelock
  • 1,971
  • 2
  • 23
  • 41
  • You sir are a magician! Thank you a thousand times for this, perfect answer and very clearly explained. – david_nash Feb 16 '17 at 03:11
  • I just had to make one tiny change for this to work correctly - on the third line from the bottom where you have `e.created_at desc` - I changed that to `e.created_at asc` - but thanks again! I'm still so impressed by this :) – david_nash Feb 16 '17 at 05:50
  • I wasn't sure if you needed the first or last 7 days. But as you found just changing the sort order gives the correct answer. Glad I could help – roblovelock Feb 16 '17 at 10:00