Group by day from timestamp with multiple tables

Question

I have two tables with timestamp columns.

I want to group the result by days. For example: from 2014/06/10 to 2014/06/13. Doesn't matter if there are records between those dates, I want it to group it by days.

My tables:

profits:

enter image description here

profits_referrals:

enter image description here

The result I want:

╔═════════════╦══════════════╦══════╗
║    date     ║  .........   ║  ..  ║
╠═════════════╬══════════════╬══════╣
║  2014/06/10 ║  .........   ║  ..  ║
║  2014/06/11 ║  .........   ║  ..  ║
║  2014/06/12 ║  .........   ║  ..  ║
║  2014/06/13 ║  .........   ║  ..  ║
╚═════════════╩══════════════╩══════╝

A note: also if there are no records from some date, I still want it to show that date, and the amount will be 0.

What I did so far:

SELECT      SUM(`profits`.`amount`) AS `profAmount`,
            COUNT(`profits`.`amount`) AS `profCount`,

            SUM(`profits_referrals`.`amount`) AS `refAmount`,
            COUNT(`profits_referrals`.`amount`) AS `refCount`,

            DATE(FROM_UNIXTIME(`profits`.`date`)) AS `profDate`,
            DATE(FROM_UNIXTIME(`profits_referrals`.`date`)) AS `refDate`
FROM        `profits`
JOIN        `profits_referrals`
ON          `profits`.`userid` = `profits_referrals`.`referral`
WHERE       `profits`.`userid` = " . (int)$user->id . "
GROUP BY    DATE(FROM_UNIXTIME(`profits`.`date`)), DATE(FROM_UNIXTIME(`profits_referrals`.`date`))
ORDER BY    `profDate`
DESC

Result: (I did some PHP code to display it)

Date              Sales              Referrals      Total
2014-04-28    2 / $7.35 USD     2 / $1.4 USD    $8.75 USD
2014-04-28    2 / $7.35 USD     2 / $1.4 USD    $8.75 USD
2014-03-27    1 / $2.10 USD     1 / $0.7 USD    $2.80 USD
2014-03-27    1 / $2.10 USD     1 / $0.7 USD    $2.80 USD
2014-03-25    3 / $6.30 USD     3 / $2.0 USD    $8.40 USD
2014-03-25    3 / $6.30 USD     3 / $2.0 USD    $8.40 USD

What have you got so far? Note that you'll have the easiest time if you have a pre-generated calendar table (which are one of the most useful "dimension"/analysis utility-tables). — Clockwork-Muse, Jun 15 '14 at 13:09
@Clockwork-Muse I didn't really understand your solution, can you explain it? — HTMHell, Jun 16 '14 at 10:01
Regardless of the way I see a solution, what have you already tried? In particular, what calculations are you trying to run over your rows? Why do you need both tables? — Clockwork-Muse, Jun 16 '14 at 10:05
@Clockwork-Muse I want to show reports from every day: how much the user have earned (from `profits`, from `profits_referrals` and total). I updated my question, please look at the code. — HTMHell, Jun 16 '14 at 10:15
Ah, _now_ we can see what your problem is... give me a minute. — Clockwork-Muse, Jun 16 '14 at 10:18

score 2 · Accepted Answer · edited Jul 29 '20 at 18:42

First off, you're joining the two tables by the id, but not by the date, which is generating a temp result set that looks like this:

t1.date        t1.amt   t2.date        t2.amt
'2014-04-28'   1        '2014-03-27'   5
'2014-04-28'   1        '2014-03-25'   6

This is because joins actually say "hey, for every row that matches this condition, put the left and right sides together". The more times rows on the right side match one row on the left side, the more times the left side is repeated. It should be pretty obvious that accurate results will be getting thrown out the window. In order to have (at most) a 1-to-1 relationship, we need to do the aggregate before the join, usually by the use of a subquery;

SELECT ....
FROM {base_table} b
JOIN (SELECT {joinColumn}, {AGGREGATE_FUNCTION}
      FROM {other_table}
      GROUP BY {joinColumn}) o
  ON o.{joinColumn} = b.{joinColumn}

Unfortunately, your dataset doesn't have a canonical "base table" - you aren't guaranteed rows in either table, so something like FULL OUTER JOIN (or the MySQL equivalent) isn't going to work (ie, you would be missing dates if neither table had them). We need to create our own base table.

You need to create what's known as a Calendar Table (this particular one is for SQL Server, but would be adaptable). These are one of the most useful dimension/analysis tables you can make or use. The actual contents are up to you, but for this type of query it fulfills the role of the {base_table}. It's also going to help us get index-access (potentially) for the grouping.

First, the revised subquery:

SELECT Calendar.calendar_date, 
       COUNT(Profits) AS profCount, COALESCE(SUM(Profits.amt), 0) AS profAmount
FROM Calendar
LEFT JOIN Profits
       ON Profits.userId = {desiredUserId}
          AND Profits.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
          AND Profits.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
WHERE Calendar.calendar_date >= {rangeStart}
      AND Calendar.calendar_date < {rangeEnd}

So.
Some things to note here:

I've put descriptions for the parameter values. In reality, you should be using parameterized queries, or you risk SQL Injection. Your current query was safe because of the cast to int, but it's better to not have to worry about it.
Always query positive continuous-range types (everything but an integer count) with an inclusive lower-bound, >=, and an exclusive upper-bound, < (the post is written for SQL Server and timestamps therein, but the problem applies everywhere. Remember that the MySQL DATETIME/TIMESTAMP types have a user-specifiable number of fractional seconds!). For negative ranges, reverse the conditions.
The use of the functions on Calendar.calendar_date (assume this is just a standard DATE type) will prevent the use of indices on the join... from the Calendar side. From the Profits side, it'll have nice singe values to search against. Presumably there are multiple rows in Profits for each calendar day, meaning that's the slow side of the join.

In any case, this will input a temp result set that looks like this:

cal_date       Count   Amount
'2014-06-10'   1       5
'2014-06-11'   0       0
'2014-06-12'   1       -9.5
'2014-06-13'   99      99999999.1

Success; single row per day, pre-aggregated amounts. We can now combine this with the query for the other table (Profits_Referrals), and get our results:

SELECT Profits.Calendar_date,
       Profits.profAmount, Profits.profCount,
       Referrals.refAmount, Referrals.refCount
FROM (SELECT Calendar.calendar_date,
             COUNT(Profits) AS profCount, COALESCE(SUM(Profits.amt), 0) AS profAmount
             FROM Calendar
             LEFT JOIN Profits
                    ON Profits.userId = ?
                       AND Profits.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
                       AND Profits.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
             WHERE Calendar.calendar_date >= ?
                   AND Calendar.calendar_date < ?) Profits
JOIN (SELECT Calendar.calendar_date,
             COUNT(Refferals) AS refCount, COALESCE(SUM(Refferals.amt), 0) AS refAmount
             FROM Calendar
             LEFT JOIN Profits_Referrals Refferals
                    ON Refferals.userId = ?
                       AND Refferals.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
                       AND Refferals.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
             WHERE Calendar.calendar_date >= ?
                   AND Calendar.calendar_date < ?) Refferals
  ON Referrals.calendar_date = Profits.calendar_date
ORDER BY Profits.Calendar_Date

(Remember that the individual subqueries are outputting a row for every date, and everything is already aggregated by that date - we can just join based on the date. This also means we don't need a separate {base_table} here)

Seriously, you want a calendar table. In my opinion, they are **the** most useful analysis table. You can slap as many indices as you want on them (because you almost never write to them), meaning you can have index-based access for queries that would normally have to ignore them (ie, aggregate by day-of-week, ignore first-of-month). For commonly used "derived" columns (like "next day", as I use here), you can create views that output the math, and if performance ends up being a problem, simply materialize them. — Clockwork-Muse, Jun 16 '14 at 11:25

Group by day from timestamp with multiple tables

1 Answers1