4

What's the best way to compute AUC in SQL?

Here is what I got (assuming table T(label, confid) and label=0,1):

SELECT sum(cumneg * label) * 1e0 / (sum(label) * sum(1-label)) AS auc
FROM ( 
  SELECT label,
    sum(1-label) OVER(ORDER BY confid ROWS UNBOUNDED PRECEDING) (BIGINT) cumneg
  FROM T
) t;

I have to multiply by 1e0 in Teradata to get a real result. The Bigint cast is necessary to avoid overflow.

iggy
  • 662
  • 6
  • 14

3 Answers3

7

Here is a slightly different and maybe simpler solution I found:

SELECT (sum(label*r) - 0.5*sum(label)*(sum(label)+1)) / (sum(label) * sum(1-label)) AS auc
FROM ( 
  SELECT label, row_number() OVER (ORDER BY confid) r
  FROM T
) t;

that returns the same result as the query in the question.

Update

This SQL query (as well as the one in the question) are non-deterministic when there are multiple examples with the same prediction (confid) but different labels. To compute a deterministic AUC using interpolation the query can be modified as follows:

SELECT (sum(pos*r) - 0.5*sum(pos)*(sum(pos)+1) - 0.5*sum(pos*neg)) / 
  (sum(pos) * sum(neg)) AS auc
FROM ( 
  SELECT pos, neg, 
    sum(pos+neg) OVER (ORDER BY confid ROWS UNBOUNDED PRECEDING) r
  FROM (
    SELECT confid, sum(label) AS pos, sum(1-label) AS neg
    FROM T
    GROUP BY confid) t
) t;

In the AUC formula, the denominator is the total number of pairs (positive X negative). The numerator computes how many of the pairs are are ranked correctly. sum(pos*r) computes the total number of pairs so far (based on confidence order). That number includes positive X positive pairs so the second term subtracts those. Finally, the last term subtracts half of positive X negative pairs with the same prediction.

iggy
  • 662
  • 6
  • 14
  • Can you pls share the underlying math explanation? – Saurav Jun 07 '21 at 05:17
  • See if this helps: https://stephanosterburg.gitbook.io/scrapbook/data-science/ds-cheatsheets/machine-learning/fast-computation-of-auc-roc-score – iggy Jun 08 '21 at 16:27
1

Below pseudo-SQL takes advantage of the fact that AUC ROC is the same as probability that the predicted score distinguishes a random positive and a random negative label. SQL assumes that both labels have at least 10000 elements. The calculated AUC is not exact, but randomised. See also the same question for R.

WITH POSITIVE_SCORES AS (
  select
    score as p_pos
  from
    TABLE
    where label = positive
    order by rand()
    limit 10000
),

NEGATIVE_SCORES AS (
  select
    score as p_neg
  from
    TABLE
    where label = negative
    order by rand()
    limit 10000
)

select
  avg(case 
    when p_pos > p_neg then 1 
    when p_pos = p_neg then 0.5 
    else 0 
  end) as auc
from
  POSITIVE_SCORES
  cross join
  NEGATIVE_SCORES
Jussi Kujala
  • 901
  • 9
  • 7
0

For calculating exact deterministic AUC score, we should aggregate by "confid" to handle cases when not all confidence values are unique. Then we just calculate trapezoid area for each unique confidence value and sum all. Also, additional check for the case when all labels are zeros or ones. Note that type can be overflowed because of multiplication - you can prevent it using BIGINT.

MS SQL Implementation:

select
    IIF(SUM(Ones) * SUM(Zeros) <> 0,
    SUM(IIF(Zeros * Ones > 0, 0.5 * Zeros * Ones + Height * Ones, Height * Ones)) / (SUM(Ones) * SUM(Zeros)), 0)
from (
        select
        Zeros,
        Ones,
        SUM(IIF(Zeros * Ones > 0, 0, Zeros) + IIF(PrevZeros * PrevOnes > 0, PrevZeros, 0)) OVER (ORDER BY PD) as Height
    from (
        select
            confid as PD,
            SUM(label) as Ones,
            SUM(ABS(1 - label)) as Zeros,
            LAG(SUM(label), 1, NULL) OVER (ORDER BY confid) as PrevOnes,
            LAG(SUM(ABS(1 - label)), 1, NULL) OVER (ORDER BY confid) as PrevZeros
        from T
        group by confid
    ) q1
) q2;
okrn
  • 54
  • 2