1

I have the following dataset with let's say ID = {1,[...],5} and Col1 = {a,b,c,Null} :

ID Col1 Date
1 a 01/10/2022
1 a 02/10/2022
1 a 03/10/2022
2 b 01/10/2022
2 c 02/10/2022
2 c 03/10/2022
3 a 01/10/2022
3 b 02/10/2022
3 Null 03/10/2022
4 c 01/10/2022
5 b 01/10/2022
5 Null 02/10/2022
5 Null 03/10/2022

I would like to group my rows by ID, compute new columns to show the number of occurences and compute a new column that would show a string of characters, depending on the frequency of Col1. With most a = Hi, most b = Hello, most c = Welcome, most Null = Unknown. If multiple modalities except Null have the same frequency, the most recent one based on date wins.

Here is the dataset I need :

ID nb_a nb_b nb_c nb_Null greatest
1 3 0 0 0 Hi
2 0 1 2 0 Welcome
3 1 1 0 1 Hello
4 0 0 1 0 Welcome
5 0 1 0 2 Unknown

I have to do this in a compute recipe in Dataiku. The group by is handled by the group by section of the recipe while the rest of the query needs to be done in the "custom aggregations" section of the recipe. I'm having troubles with the if equality then most recent part of the code.

My SQL code looks like this :

CASE WHEN SUM(CASE WHEN Col1 = a THEN 1 ELSE 0) >
          SUM(CASE WHEN Col1 = b THEN 1 ELSE 0)
      AND SUM(CASE WHEN Col1 = a THEN 1 ELSE 0) >
          SUM(CASE WHEN Col1 = c THEN 1 ELSE 0)
     THEN 'Hi'

CASE WHEN SUM(CASE WHEN Col1 = b THEN 1 ELSE 0) >
          SUM(CASE WHEN Col1 = a THEN 1 ELSE 0)
      AND SUM(CASE WHEN Col1 = b THEN 1 ELSE 0) >
          SUM(CASE WHEN Col1 = c THEN 1 ELSE 0)
     THEN 'Hello'

CASE WHEN SUM(CASE WHEN Col1 = c THEN 1 ELSE 0) >
          SUM(CASE WHEN Col1 = a THEN 1 ELSE 0)
      AND SUM(CASE WHEN Col1 = c THEN 1 ELSE 0) >
          SUM(CASE WHEN Col1 = b THEN 1 ELSE 0)
     THEN 'Welcome'

Etc, etc, repeat for other cases.

But surely there must be a better way to do this right? And I have no idea how to include the most recent one when frequencies are the same.

Thank you for your help and sorry if my message isn't clear.

1 Answers1

0

I tried to repro this in Azure Synapse using SQL script. Below is the approach.

  • Sample Table is created as in below image.
Create  table tab1 (id int, col1 varchar(50), date_column date)
Insert  into tab1 values(1,'a','2021-10-01')
Insert  into tab1 values(1,'a','2021-10-02')
Insert  into tab1 values(1,'a','2021-10-03')
Insert  into tab1 values(2,'b','2021-10-01')
Insert  into tab1 values(2,'c','2021-10-02')
Insert  into tab1 values(2,'c','2021-10-03')
Insert  into tab1 values(3,'a','2021-10-01')
Insert  into tab1 values(3,'b','2021-10-02')
Insert  into tab1 values(3,'Null','2021-10-03')
Insert  into tab1 values(4,'c','2021-10-01')
Insert  into tab1 values(5,'b','2021-10-01')
Insert  into tab1 values(5,'Null','2021-10-02')
Insert  into tab1 values(5,'Null','2021-10-03')

enter image description here

Step:1

Query is written to find the count of values within the group id,col1 and maximum date value within each combination of id, col1.

select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1

enter image description here

Step:2

Row number is calculated within each id, col1 group on the decreasing order of count and max_date columns. This is done when two or more values have same frequency, then to assign value based on latest date.

select *, row_number() over (partition by id order by count desc, max_date desc) as row_num from
(select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1)q1

enter image description here

Step:3

Line items with row_num=1 are filtered and values for the greatest column is assigned with the logic most a = Hi, most b = Hello, most c = Welcome, most Null = Unknown.

Full Query

select id, 
[greatest]=case when col1='a' then 'Hi'
                when col1='b' then 'Hello'
                when col1='c' then 'Welcome'
                else 'Unknown'
                end 
from
(select *, row_number() over (partition by id order by count desc, max_date desc) as row_num from
(select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1)q1
)q2 where row_num=1

Output enter image description here

By this approach, even when the frequencies are same, based on the most recent date, required values can be updated.

Aswin
  • 4,090
  • 2
  • 4
  • 16