First off, you're joining the two tables by the id
, but not by the date
, which is generating a temp result set that looks like this:
t1.date t1.amt t2.date t2.amt
'2014-04-28' 1 '2014-03-27' 5
'2014-04-28' 1 '2014-03-25' 6
This is because joins actually say "hey, for every row that matches this condition, put the left and right sides together". The more times rows on the right side match one row on the left side, the more times the left side is repeated. It should be pretty obvious that accurate results will be getting thrown out the window. In order to have (at most) a 1-to-1 relationship, we need to do the aggregate before the join, usually by the use of a subquery;
SELECT ....
FROM {base_table} b
JOIN (SELECT {joinColumn}, {AGGREGATE_FUNCTION}
FROM {other_table}
GROUP BY {joinColumn}) o
ON o.{joinColumn} = b.{joinColumn}
Unfortunately, your dataset doesn't have a canonical "base table" - you aren't guaranteed rows in either table, so something like FULL OUTER JOIN
(or the MySQL equivalent) isn't going to work (ie, you would be missing dates if neither table had them). We need to create our own base table.
You need to create what's known as a Calendar Table (this particular one is for SQL Server, but would be adaptable). These are one of the most useful dimension/analysis tables you can make or use. The actual contents are up to you, but for this type of query it fulfills the role of the {base_table}
. It's also going to help us get index-access (potentially) for the grouping.
First, the revised subquery:
SELECT Calendar.calendar_date,
COUNT(Profits) AS profCount, COALESCE(SUM(Profits.amt), 0) AS profAmount
FROM Calendar
LEFT JOIN Profits
ON Profits.userId = {desiredUserId}
AND Profits.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
AND Profits.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
WHERE Calendar.calendar_date >= {rangeStart}
AND Calendar.calendar_date < {rangeEnd}
So.
Some things to note here:
- I've put descriptions for the parameter values. In reality, you should be using parameterized queries, or you risk SQL Injection. Your current query was safe because of the cast to
int
, but it's better to not have to worry about it.
- Always query positive continuous-range types (everything but an integer count) with an inclusive lower-bound,
>=
, and an exclusive upper-bound, <
(the post is written for SQL Server and timestamps therein, but the problem applies everywhere. Remember that the MySQL DATETIME
/TIMESTAMP
types have a user-specifiable number of fractional seconds!). For negative ranges, reverse the conditions.
- The use of the functions on
Calendar.calendar_date
(assume this is just a standard DATE
type) will prevent the use of indices on the join... from the Calendar
side. From the Profits
side, it'll have nice singe values to search against. Presumably there are multiple rows in Profits
for each calendar day, meaning that's the slow side of the join.
In any case, this will input a temp result set that looks like this:
cal_date Count Amount
'2014-06-10' 1 5
'2014-06-11' 0 0
'2014-06-12' 1 -9.5
'2014-06-13' 99 99999999.1
Success; single row per day, pre-aggregated amounts. We can now combine this with the query for the other table (Profits_Referrals
), and get our results:
SELECT Profits.Calendar_date,
Profits.profAmount, Profits.profCount,
Referrals.refAmount, Referrals.refCount
FROM (SELECT Calendar.calendar_date,
COUNT(Profits) AS profCount, COALESCE(SUM(Profits.amt), 0) AS profAmount
FROM Calendar
LEFT JOIN Profits
ON Profits.userId = ?
AND Profits.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
AND Profits.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
WHERE Calendar.calendar_date >= ?
AND Calendar.calendar_date < ?) Profits
JOIN (SELECT Calendar.calendar_date,
COUNT(Refferals) AS refCount, COALESCE(SUM(Refferals.amt), 0) AS refAmount
FROM Calendar
LEFT JOIN Profits_Referrals Refferals
ON Refferals.userId = ?
AND Refferals.date >= UNIX_TIMESTAMP(Calendar.calendar_date)
AND Refferals.date < UNIX_TIMESTAMP(Calendar.calendar_date + INTERVAL 1 DAY)
WHERE Calendar.calendar_date >= ?
AND Calendar.calendar_date < ?) Refferals
ON Referrals.calendar_date = Profits.calendar_date
ORDER BY Profits.Calendar_Date
(Remember that the individual subqueries are outputting a row for every date, and everything is already aggregated by that date - we can just join based on the date. This also means we don't need a separate {base_table}
here)