I have always been struggling with these types of queries. So, I'd like that someone checks my approach to handle those. I am asked to find how many transactions, on average, each user executes during a 12 hours timespan starting from the first transaction.
This is the data:
CREATE TABLE IF NOT EXISTS `transactions` (
`transactions_ts` timestamp ,
`user_id` int(6) unsigned NOT NULL,
`transaction_id` bigint NOT NULL,
`item` varchar(200), PRIMARY KEY(`transaction_id`)
) DEFAULT CHARSET=utf8;
INSERT INTO `transactions` (`transactions_ts`, `user_id`, `transaction_id`,`item` ) VALUES
('2016-06-18 13:46:51.0', 13811335,1322361417, 'glove'),
('2016-06-18 17:29:25.0', 13811335,3729362318, 'hat'),
('2016-06-18 23::07:12.0', 13811335,1322363995,'vase' ),
('2016-06-19 07:14:56.0',13811335,7482365143, 'cup'),
('2016-06-19 21:59:40.0',13811335,1322369619,'mirror' ),
('2016-06-17 12:39:46.0',3378024101,9322351612, 'dress'),
('2016-06-17 20:22:17.0',3378024101,9322353031,'vase' ),
('2016-06-20 11:29:02.0',3378024101,6928364072,'tie'),
('2016-06-20 18:59:48.0',13811335,1322375547, 'mirror');
My approach is the following (with the steps and the query itself below):
1) For each distinct user_id, find their first and 12 hours' transaction timestamp. This is accomplished by the inner query aliased as t1
2) Then, by inner join to the second inner query (t2), basically, I augment each row of the transactions table with two variables "first_trans" and "right_trans" of the 1st step. 3) Now, by where-condition, I select only those transaction timestamps that fall in the interval specified by first_trans and right_trans timestamps
4) Filtered table from the step 3 is now aggregated as count distinct transaction ids per user
5) The result of the 4 steps above is a table where each user has a count of transactions falling into the interval of 12 hrs from the first timestamp. I wrap it in another select that sums users' transaction counts and divides by the number of users, giving an average count per user.
I am quite certain that the end result is correct overall, but I keep thinking I might go without the 4th select. Or, perhaps, the whole code is somewhat clumsy, while my aim was to make this query as readable as possible, and not necessarily computationally optimal.
select
sum(dist_ts)/count(*) as avg_ts_per_user
from (
select
count(distinct transaction_id) as dist_ts,
us_id
from
(select
user_id as us_id,
min(transactions_ts) as first_trans,
min(transactions_ts) + interval 12 hour as right_trans
from transactions
group by us_id )
as t1
inner join
(select * from transactions )
as t2
on t1.us_id=t2.user_id
where transactions_ts >= first_trans
and transactions_ts < right_trans
group by us_id
) as t3