I have a huge table with over 1 million transaction records and I need to join this table to itself and pull all similar transactions within 52 weeks prior for each transaction and aggregate them for later use in an ML model.
select distinct a.transref,
a.transdate, a.transamount,
a.transtype,
avg (b.transamount)
over (partition by a.transref,a.transdate, a.transamount,a.transtype) as avg_trans_amount
from trans_table a
inner join trans_table b
on a.transtype = b.transtype
and b.transdate <= dateadd(week, -52, a.transdate)
and b.transdate <= a.transdate
and a.transdate between '2022-11-16' and '2021-11-16'
the transaction table looks like this:
+--------+----------+-----------+---------+
|transref|trasndate |transamount|transtype|
+--------+----------+-----------+---------+
|xh123rdk|2022-11-16|112.48 |food & Re|
|g8jegf90|2022-11-04|23.79 |Misc |
|ulpef32p|2022-10-23|83.15 |gasoline |
+--------+----------+-----------+---------+
and the expected output should look like this:
+--------+----------+-----------+---------+----------------+
|transref|trasndate |transamount|transtype|avg_trans_amount|
+--------+----------+-----------+---------+----------------+
|xh123rdk|2022-11-16|112.48 |food & Re|180.11 |
|g8jegf90|2022-11-04|23.79 |Misc |43.03 |
|ulpef32p|2022-10-23|83.15 |gasoline |112.62 |
+--------+----------+-----------+---------+----------------+
Since each transaction may pull over 10,000 similar type records the query is very slow and expensive to run, therefore SQL Server failed to create the output table.
How can I optimize this query to run efficiently within a reasonable time?
Note: After failing to run the query, I ended up creating a stored procedure to split the original table a
into smaller chunks, join it to the big table, aggregate the results and append the results to an output table and repeat this until the entire table a
was covered. This way I could manage to do the job, however, it was still slow. I expect there are better ways to do it in SQL without all this manual work.