Join and aggregate two huge tables efficiently

Question

I have a huge table with over 1 million transaction records and I need to join this table to itself and pull all similar transactions within 52 weeks prior for each transaction and aggregate them for later use in an ML model.

select distinct a.transref, 
    a.transdate, a.transamount, 
    a.transtype,
    avg (b.transamount) 
        over (partition by a.transref,a.transdate, a.transamount,a.transtype) as avg_trans_amount
from trans_table a
inner join trans_table b
on a.transtype = b.transtype
and b.transdate <= dateadd(week, -52, a.transdate)
and b.transdate <= a.transdate
and a.transdate between '2022-11-16' and '2021-11-16'

the transaction table looks like this:

+--------+----------+-----------+---------+
|transref|trasndate |transamount|transtype|
+--------+----------+-----------+---------+
|xh123rdk|2022-11-16|112.48     |food & Re|
|g8jegf90|2022-11-04|23.79      |Misc     |
|ulpef32p|2022-10-23|83.15      |gasoline |
+--------+----------+-----------+---------+

and the expected output should look like this:

+--------+----------+-----------+---------+----------------+
|transref|trasndate |transamount|transtype|avg_trans_amount|
+--------+----------+-----------+---------+----------------+
|xh123rdk|2022-11-16|112.48     |food & Re|180.11          |
|g8jegf90|2022-11-04|23.79      |Misc     |43.03           |
|ulpef32p|2022-10-23|83.15      |gasoline |112.62          |
+--------+----------+-----------+---------+----------------+

Since each transaction may pull over 10,000 similar type records the query is very slow and expensive to run, therefore SQL Server failed to create the output table.

How can I optimize this query to run efficiently within a reasonable time?

Note: After failing to run the query, I ended up creating a stored procedure to split the original table a into smaller chunks, join it to the big table, aggregate the results and append the results to an output table and repeat this until the entire table a was covered. This way I could manage to do the job, however, it was still slow. I expect there are better ways to do it in SQL without all this manual work.

actually the only way I was able to successfully do this was by splitting the first table to smaller chunks, joining, aggregating, writing to a table and then appending the next chunk to it, however I think there must be better way of doing this if I don't want to do all this manual work. — Arash, Nov 16 '22 at 04:19
Do you want to know about transactions that occurred within 52 weeks of each other ... 10 years ago? Seems like there should be a lower bounds condition to define the latest transactions of interest. And something to correlate transactions owned by a particular account/owner/user. — AlwaysLearning, Nov 16 '22 at 04:21
thanks, I have already filtered out the transactions in table a (lower bounds are already considered). I'd like to know how a transaction in a particular date is related to similar type transaction s n a window of time before that. it doesn't matter who did this transaction and from what account. — Arash, Nov 16 '22 at 04:29
I doubt the table primary key is just these 2 columns `transtype` and `transdate` . By self joining the table on that 2 columns, you are killing the performance. Why don't you use a simple `GROUP BY` query ? — Squirrel, Nov 16 '22 at 04:32
transref is the actual unique variable of the table. again I need to do some staticall calculations like avg, median, percentile over a window of time for each transaction. — Arash, Nov 16 '22 at 04:37
Please provide your working code, your table and index definitions, and your actual execution plan which you can upload on [Paste The Plan](https://www.brentozar.com/pastetheplan). — J.D., Nov 16 '22 at 04:41
This shows no research or understanding of performance issues--what it is asking about. [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) Debug questions require a [mre]. [ask] [Help] — philipxy, Nov 16 '22 at 06:18
[Tips for asking a good SQL question](https://meta.stackoverflow.com/q/271055/3404097) [Asking query performance questions](https://dba.meta.stackexchange.com/q/3034/43932) [How to Optimize Queries in a Database - The Basics](https://stackoverflow.com/q/3191623/3404097) [What are your most common sql optimizations?](https://stackoverflow.com/q/1332778/3404097) [When and why are database joins expensive?](https://stackoverflow.com/q/173726/3404097) [Re SQL "performance".](https://stackoverflow.com/a/24196511/3404097) — philipxy, Nov 16 '22 at 06:19

score 0 · Answer 1 · answered Nov 19 '22 at 02:29

ok, I think I figured out what's causing the query to run tooslow. the trick is to avoid repetitive and unnecessary calculations by doing some group by first before doing the join.

with merch as (
    select transtype,
    dateadd(week, -52, transdate) as startdate,
    transdate as enddate),
    from trans_table 
    group by transtype, transdate),


summary as (
    select distinct transtype,
    stratdate, enddate, 
 avg(t.transamt) over (partition by 
 m.transtype, m.startdate, m.enddate) as avg_amt,


 percentile_cont(0.5) within group (order by t.transamt) over (partition by
 m.transtype, m.startdate, m.enddate) as median_amt

from merch as m
inner join trans_table as t
on m.transtype = t.transdate and
   t.transdate between m.starttype and
   m.enddate)



select t.*, s.avg_amt  s.median_amt
from trans_table t
inner join summary s
on t.transtype = s.transtype
and t.transdate = s.enddate

Join and aggregate two huge tables efficiently

1 Answers1