0

I have always been struggling with these types of queries. So, I'd like that someone checks my approach to handle those. I am asked to find how many transactions, on average, each user executes during a 12 hours timespan starting from the first transaction.

This is the data:

CREATE TABLE IF NOT EXISTS `transactions` (
`transactions_ts` timestamp ,
`user_id` int(6) unsigned NOT NULL,
`transaction_id` bigint NOT NULL, 
`item` varchar(200), PRIMARY KEY(`transaction_id`)
) DEFAULT CHARSET=utf8;

INSERT INTO `transactions` (`transactions_ts`, `user_id`, `transaction_id`,`item` ) VALUES
  ('2016-06-18 13:46:51.0', 13811335,1322361417, 'glove'),
  ('2016-06-18 17:29:25.0', 13811335,3729362318, 'hat'), 
  ('2016-06-18 23::07:12.0', 13811335,1322363995,'vase' ),
  ('2016-06-19 07:14:56.0',13811335,7482365143, 'cup'),
  ('2016-06-19 21:59:40.0',13811335,1322369619,'mirror' ),
  ('2016-06-17 12:39:46.0',3378024101,9322351612, 'dress'),
  ('2016-06-17 20:22:17.0',3378024101,9322353031,'vase' ),
  ('2016-06-20 11:29:02.0',3378024101,6928364072,'tie'),
  ('2016-06-20 18:59:48.0',13811335,1322375547, 'mirror');

My approach is the following (with the steps and the query itself below):

1) For each distinct user_id, find their first and 12 hours' transaction timestamp. This is accomplished by the inner query aliased as t1

2) Then, by inner join to the second inner query (t2), basically, I augment each row of the transactions table with two variables "first_trans" and "right_trans" of the 1st step. 3) Now, by where-condition, I select only those transaction timestamps that fall in the interval specified by first_trans and right_trans timestamps

4) Filtered table from the step 3 is now aggregated as count distinct transaction ids per user

5) The result of the 4 steps above is a table where each user has a count of transactions falling into the interval of 12 hrs from the first timestamp. I wrap it in another select that sums users' transaction counts and divides by the number of users, giving an average count per user.

I am quite certain that the end result is correct overall, but I keep thinking I might go without the 4th select. Or, perhaps, the whole code is somewhat clumsy, while my aim was to make this query as readable as possible, and not necessarily computationally optimal.

select 

sum(dist_ts)/count(*) as avg_ts_per_user

from (

select 
count(distinct transaction_id) as dist_ts, 
us_id 
from 

(select 
user_id as us_id, 
min(transactions_ts) as first_trans, 
min(transactions_ts) + interval 12 hour as right_trans 

from transactions 
group by us_id ) 

as t1 

inner join 

(select * from transactions ) 
as t2 

on t1.us_id=t2.user_id

where transactions_ts >= first_trans
and transactions_ts <  right_trans
group by us_id

) as t3 

Fiddle demo

  • I added a https://dbfiddle.uk/?rdbms=mysql_5.6&fiddle=e0016f9a4fb6ff94a64ba9ccba4173bc – Alfred Beit Sep 07 '19 at 13:58
  • The result is computed in the fiddle itself just run it. I need more advice on how to improve this code from the syntax point. And my claim is that the computation is correct, however, if someone spots a mistake, I'd appreciate – Alfred Beit Sep 07 '19 at 14:01
  • If you mean that i need to provide an output of every step of the code (like debugging) that is another story. I'll try to do this. – Alfred Beit Sep 07 '19 at 14:02
  • ok then i misinterpret the question guess i need some more coffee.. *"while my aim was to make this query as readable as possible, and not necessarily computationally optimal."* Code reviews are generally [offtopic](https://stackoverflow.com/help/on-topic) for stackoverflow or in the gray area – Raymond Nijland Sep 07 '19 at 14:03
  • .. but if that is really the create table structure which is also "live" this table could use some indexing.. – Raymond Nijland Sep 07 '19 at 14:04
  • Ok i see. But then again i am not 100% sure the query is correct. I tend to oversee sometimes a detail or two. – Alfred Beit Sep 07 '19 at 14:05
  • *"But then again i am not 100% sure the query is correct. I tend to oversee sometimes a detail or two."* If that's the case see [Why should I provide a Minimal Reproducible Example for a very simple SQL query?](https://meta.stackoverflow.com/questions/333952/why-should-i-provide-a-minimal-reproducible-example-for-a-very-simple-sql-query) for providing us expected results.. How else can we verify the correctness if we don't know what the result should be? – Raymond Nijland Sep 07 '19 at 14:06
  • yeah, i see, thanks for correcting. I added a primary key to the create table. As per the offtopic, this question does seem like a code review, sorry – Alfred Beit Sep 07 '19 at 14:14
  • I'm voting to close this question as off-topic because this question seams to be better ontopic/suited for https://codereview.stackexchange.com/ – Raymond Nijland Sep 07 '19 at 14:18

1 Answers1

1

I don't think there is a mistake per se. The code can be slightly simplified (and neatened up a bit as follows):

select sum(dist_ts)/count(*) as avg_ts_per_user
from (
    select count(distinct transaction_id) as dist_ts, us_id
    from (
        select user_id as us_id, min(transactions_ts) as first_trans, min(transactions_ts) + interval 12 hour as right_trans
        from transactions
        group by us_id
        ) as t1
    inner join transactions as t2
    on t1.us_id=t2.user_id and transactions_ts >= first_trans and transactions_ts <  right_trans
    group by us_id
) as t3

The (select * from transactions ) as t2 was simplified above and I somewhat arbitrarilly moved a where clause condition to the on clause of the inner join.

My Fiddle Demo

Here is a second way that does not use inner joins:

select sum(cnt)/count(*) as avg_ts_per_user from (
    select count(*) as cnt, t.user_id
    from transactions t
    where t.transactions_ts >= (select min(transactions_ts) from transactions where user_id = t.user_id)
         and t.transactions_ts < (select min(transactions_ts) + interval 12 hour from transactions where user_id = t.user_id)
    group by t.user_id
) sq

Another Fiddle

You should probably run EXPLAIN against the two queries to see which one runs better on your server. Also note that min(transaction_ts) is specified twice for each user. Is MySql able to avoid the redundant calculation? I don't know. One possibility would be to create a temporary table consisting of user_id and min_transaction_ts so that the value is computed once. This would only make sense if your table had lots of rows and maybe not even then.

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • yeah, thanks. I am usually hesitant to combine on and where. – Alfred Beit Sep 07 '19 at 14:11
  • It can make a difference (in OUTER JOINs) but not in this particular case. See [SQL join: where clause vs. on clause](https://stackoverflow.com/questions/354070/sql-join-where-clause-vs-on-clause). – Booboo Sep 07 '19 at 14:18
  • I have revised my answer with a second method that does not use an inner join. You should do run EXPLAIN against the two queries to see which might run better for you installation. – Booboo Sep 07 '19 at 14:44