2

I'm doing Recency-Frequency-Monetary analysis and while I have a model working in Python I'm trying to implement it in SQL due to production code being mainly PHP (Oracle 12c fwiw or could also be done in postgres). I saw an example online which added new columns using ntile however I want to use the equivalent of quantile (e.g. splitting up recency bin edges to 0,0,74,321 for 5 quantiles). I have started going the route below, but is there a tidier way?

SELECT
   percentile_disc(0.2) within group (order by recency) as recency_1_quin,
   percentile_disc(0.4) within group (order by recency) as recency_2_quin,
   percentile_disc(0.6) within group (order by recency) as recency_3_quin,
   percentile_disc(0.8) within group (order by recency) as recency_4_quin,
   percentile_disc(0.2) within group (order by monetary_value) as monetary_1_quin,
   percentile_disc(0.4) within group (order by monetary_value) as monetary_2_quin,
   percentile_disc(0.6) within group (order by monetary_value) as monetary_3_quin,
   percentile_disc(0.8) within group (order by monetary_value) as monetary_4_quin,
   percentile_disc(0.2) within group (order by frequency) as frequency_1_quin,
   percentile_disc(0.4) within group (order by frequency) as frequency_2_quin,
   percentile_disc(0.6) within group (order by frequency) as monetary_3_quin,
   percentile_disc(0.8) within group (order by frequency) as monetary_4_quin
FROM RFM;

Edit: With help, this is where I'm at, but stuck on a group by within a CTE. [42000][978] ORA-00978: nested group function without GROUP BY

RFM AS (
SELECT SRC_USER_ID,
  COUNT(distinct PICKUP_DATE) -1 as frequency,
  (MAX(PICKUP_DATE) - MIN(PICKUP_DATE)) as recency,
  (TO_DATE ('2018/05/12', 'yyyy/mm/dd') - MIN(PICKUP_DATE)) as T,
  SUM(PRICE_TOTAL) AS monetary_value
FROM TRANSACTIONS
group by SRC_USER_ID
ORDER BY frequency DESC),
  MAX_VALUES AS (
      select sum(max(recency) + 0.0000000001) as max_recency,
              sum(max(frequency) + 0.00000001) as max_frequency,
             sum(max(monetary_value) + 0.000000001) as max_monetary
      FROM RFM
  )
SELECT
    SRC_USER_ID,
    recency,
    frequency,
    monetary_value,
  WIDTH_BUCKET(recency, 0, max_recency, 5) "recency_quantile",
  WIDTH_BUCKET(frequency, 0, max_frequency, 5) "frequency_quantile",
  WIDTH_BUCKET(monetary_value, 0, max_monetary, 5) "monetary_quantile"
FROM RFM, MAX_VALUES
eamon1234
  • 1,555
  • 3
  • 19
  • 38
  • Does [this](https://stackoverflow.com/q/8529985) help? –  May 18 '18 at 07:24
  • Thanks for that and for the formatting! I don't immediately see how that solves it. There seems to be some (understandable) confusion between quantile and ntile... quantile is a statistical concept of breaking up a distribution, while ntile as implmented in sql and just divides it into an even number of rows. – eamon1234 May 18 '18 at 07:28
  • So are you maybe looking for `width_bucket()` then? –  May 18 '18 at 07:32
  • Yes that looks to do part of the trick, thanks! That would be the second part of the query (width_bucket(frequency_1_quin, frequency_2_quin) etc. Is there a nicer way to get teh freq_1,2? – eamon1234 May 18 '18 at 07:35
  • I guess what I need in the function is min(recency) and max(recency), however if I include that I have to use a groupby recency and it doesn't work. SELECT recency, WIDTH_BUCKET(recency, min(recency), max(recency), 5) "recency_quantile" FROM RFM GROUP BY RFM.SRC_USER_ID, recency; All values get assigned to 6 (the overflow bucket) – eamon1234 May 18 '18 at 07:51

1 Answers1

0

width_bucket() didn't give what I was looking for, so I ended up with the below.

RFM AS (
SELECT SRC_USER_ID,
  COUNT(distinct PICKUP_DATE) -1 as frequency,
  (MAX(PICKUP_DATE) - MIN(PICKUP_DATE)) as recency,
  (TO_DATE ('2018/05/12', 'yyyy/mm/dd') - MIN(PICKUP_DATE)) as T,
  SUM(PRICE_TOTAL) AS monetary_value
FROM TRANSACTIONS
group by SRC_USER_ID
ORDER BY frequency DESC),
QUINTILES AS (SELECT
   percentile_disc(0.2) within group (order by recency) as recency_1_quin,
   percentile_disc(0.4) within group (order by recency) as recency_2_quin,
   percentile_disc(0.6) within group (order by recency) as recency_3_quin,
   percentile_disc(0.8) within group (order by recency) as recency_4_quin,
   percentile_disc(0.2) within group (order by monetary_value) as monetary_value_1_quin,
   percentile_disc(0.4) within group (order by monetary_value) as monetary_value_2_quin,
   percentile_disc(0.6) within group (order by monetary_value) as monetary_value_3_quin,
   percentile_disc(0.8) within group (order by monetary_value) as monetary_value_4_quin,
   percentile_disc(0.2) within group (order by frequency) as frequency_1_quin,
   percentile_disc(0.4) within group (order by frequency) as frequency_2_quin,
   percentile_disc(0.6) within group (order by frequency) as frequency_3_quin,
   percentile_disc(0.8) within group (order by frequency) as frequency_4_quin
FROM RFM)
SELECT
    SRC_USER_ID,
    recency,
    frequency,
    monetary_value,
  (CASE WHEN recency <= recency_1_quin THEN 1
   WHEN recency > recency_1_quin and recency <= recency_2_quin THEN 2
   WHEN recency > recency_2_quin and recency <= recency_3_quin THEN 3
   WHEN recency > recency_3_quin and recency <= recency_4_quin THEN 4
   ELSE 5 END) "recency_quantile",
       (CASE WHEN frequency <= frequency_1_quin THEN 1
   WHEN frequency > frequency_1_quin and frequency <= frequency_2_quin THEN 2
   WHEN frequency > frequency_2_quin and frequency <= frequency_3_quin THEN 3
   WHEN frequency > frequency_3_quin and frequency <= frequency_4_quin THEN 4
   ELSE 5 END) "frequency_quantile",
     (CASE WHEN monetary_value <= monetary_value_1_quin THEN 1
   WHEN monetary_value > monetary_value_1_quin and monetary_value <= monetary_value_2_quin THEN 2
   WHEN monetary_value > monetary_value_2_quin and monetary_value <= monetary_value_3_quin THEN 3
   WHEN monetary_value > monetary_value_3_quin and monetary_value <= monetary_value_4_quin THEN 4
   ELSE 5 END) "monetary_value_quantile"

FROM RFM, QUINTILES
ORDER BY recency DESC, frequency DESC, monetary_value DESC;
eamon1234
  • 1,555
  • 3
  • 19
  • 38