0

I am working on a large database with millions of rows and I am trying to be efficient in my queries. The database contains regular snapshots of a loan portfolio where sometimes loans default (status goes from '1' to <>'1'). When they do, they appear only once in the corresponding snapshot, then they are no longer reported. I am trying to get a cumulative count of such loans - as they develop over time and divided into many buckets depending on country of origin, vintage, etc. SUM (...) OVER seems to be a very efficient function to achieve the result but when I run the following query

Select 
assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status, 
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType, 
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal, 
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(aa27) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(aa26) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22>='2014-01' and aa22<='2014-12' and vintage='2015' and active=0 and aa74<>'1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate

I get

SQL Error (8120) column aa27 is invalid in the selected list because it is not contained in either an aggregate function or the GROUP BY clause

Can anyone shed some light? Thanks

GIG
  • 41
  • 1
  • 4

2 Answers2

0

I believe you want:

Select assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status, 
       AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType, 
       count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal, 
       SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
       SUM(SUM(aa27)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
       SUM(SUM(aa26)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22 >= '2014-01' and aa22 <= '2014-12' and vintage = '2015' and
      active = 0 and aa74 <> '1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate;

Note the SUM(SUM()) in the cumulative sum expressions.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I thought aggregates within aggregates are not allowed? i.e. `sum(sum())`? Whenever I try doing things like this I get `Cannot perform an aggregate function on an expression containing an aggregate or a subquery` – Simon Oct 30 '17 at 18:54
  • Thanks a lot for the insight, this was a big step in the right direction and gets rid of the error message (!) but the total keeps progressing, while I would have hoped that it would start from zero for each change in the GROUP BY bucket. E.g. if cumulative defaults are 20 for employees and 30 for self-employed, with this solution I get one single total count of 50 – GIG Oct 30 '17 at 22:07
  • @GIG . . . Each "group by bucket" produces one row, so I don't know what you mean by "progressing". Perhaps you should ask another question -- with a slightly simplified data set. – Gordon Linoff Oct 31 '17 at 01:32
  • I agree I need more studying starting from a simpler set. meanwhile, i believe using PARTITION BY and removing GROUP BY could go in the right direction. I also found a useful explanation of SUM OVER, PARTITION, RANGE, etc here: http://www.sqlservercentral.com/articles/Over+Clause/132079/ – GIG Oct 31 '17 at 10:39
0

This is what I found to be working, comparing my results with some external research data. I have simplified the fields for readability:

    select 
      poolcutoffdate, 
      count(1) as LoanCount,
      MAX(sum(case status when 'default' then 1 else 0 end)) 
      over (order by poolcutoffdate 
            ROWS between unbounded preceding AND CURRENT ROW) as CumulDefaults

from myDatabase
group by poolcutoffdate
order by poolcutoffdate asc

I am thus counting all loans that have been in the 'default' status at least once from inception to the current cutoff date.

Note the use of MAX(SUM()) so that the result is the largest of the various iteration from the first to the current row. Using SUM(SUM()) would add the various iterations leading to a cumulative of cumulatives.

I considered using SUM(SUM()) with "PARTITION BY poolcutoffdate" so that the tally restarts from 0 and does not add from the previous cutoff date but this would only include loans from the latest cutoff so if a loan had defaulted and removed from the pool it would wrongly not be counted.

Note the CASE in the OVER statement.

Thanks for all the help

GIG
  • 41
  • 1
  • 4