2

My goal is to select an average of exactly 5 records only if they meet the left join criteria to another table. Let's say we have table one (left) with records:

RECNUM   ID    DATE         JOB
1      | cat | 2019.01.01 | meow
2      | dog | 2019.01.01 | bark

And we have table two (right) with records:

RECNUM   ID    Action_ID    DATE         REWARD
1      | cat | 1          | 2019.01.02 | 20
2      | cat | 99         | 2018.12.30 | 1
3      | cat | 23         | 2019.12.28 | 20       
4      | cat | 54         | 2018.01.01 | 20
5      | cat | 32         | 2018.01.02 | 20
6      | cat | 21         | 2018.01.03 | 20
7      | cat | 43         | 2018.12.28 | 1
8      | cat | 65         | 2018.12.29 | 1
9      | cat | 87         | 2018.09.12 | 1
10     | cat | 98         | 2018.10.11 | 1 
11     | dog | 56         | 2018.09.01 | 99 
12     | dog | 42         | 2019.09.02 | 99 

A result should return:

ID  | AVG(Reward_from_latest_5_jobs)
cat | 1

The criteria met should be: For each JOB from left table, try to find 5 latest but older unique Action_ID(s) for the same ID in the right table and calculate average for them. So in other words, dog has barked, we do not know what reward to give him and we try to count the average of the latest five rewards he got. If less than 5 found, do not return anything/put null, if more, discard the oldest ones.

The way I wanted to do it is like:

         SELECT a."ID", COUNT(b."Action_ID"), AVG(b."REWARD")  
         FROM 
             ( 
                SELECT "ID", "DATE"
                 FROM :left_table
             ) a  

              LEFT JOIN

             ( 
                SELECT "ID", "Action_ID", "DATE", "REWARD"
                 FROM :right_table
             ) b 

             ON(
                    a."ID" = b."ID" 
               )    
         WHERE a."DATE" > b."DATE" 
         GROUP BY a."ID"
         HAVING COUNT(b."Action_ID") >= 5;

But then it would calculate for all the Action_ID(s) that match the criteria and not only the five latest ones. Could you please tell how to achieve expected results? I can use sub-tables and it does not have to be done in one SQL statement. Procedures are not allowed for this use case. Any input highly appreciated.

wounky
  • 97
  • 1
  • 12
  • I am struggling with the results in case in the left table there is an "additional ID" that has to be preserved.to the result together with standard "ID". The problem is that for each "ID" there could be and are multiple "additional ID". In result it makes distincion per this "additional ID" and multiplies the rows in a join. Might need another question for it if I will not find a solution. For the reason of this, i think all the answers were very helpful and it is hard to pick just one. – wounky Nov 19 '19 at 17:09
  • most likely solved by doing a AVG partition over "additional ID" in the main select – wounky Nov 19 '19 at 18:17

3 Answers3

1

You could use window functions, then aggregation:

select 
    id,
    avg(reward) avg_reward
from (
    select 
        t1.id, 
        t2.reward, 
        count(*) over(partition by t1.id) cnt,
        rank() over(partition by t1.id order by t2.date desc) rn
    from leftable t1
    inner join righttable t2 on t1.id = t2.id and t2.date >= t1.date
) t
where cnt >= 5 and rn <= 5
group by id

The inner query joins the table according to your requirement, does a window count of the total available records for each id and ranks the record of each id by descending date.

Then the outer query filters on ids that have at least 5 records, and computes the average of the top 5 records for each id.

GMB
  • 216,147
  • 25
  • 84
  • 135
  • Thank you so much! Will think about it, try out and provide feedback! – wounky Nov 18 '19 at 20:21
  • Hardest for me to grasp so will try at last. Seems very nice though. – wounky Nov 18 '19 at 21:36
  • I have just tried it. Works neat but there is one drawback. If there are multiple entries in the right table on the same day, they are summed up and the average is calculated after the sum. So it will sum up values from the same day and only after add rank() to them. Meaning if there were multiple rewards on the same day, they will be SUMed up and not ranked so the average could be calculated out of them. Also changed t2.date >= t1.date to t2.date < t1.date – wounky Nov 19 '19 at 10:49
  • Trying to figure out how to prevent aggregation - sum on the REWARD on the same day so it can be taken into AVG. – wounky Nov 19 '19 at 10:56
  • Adding Action_ID (unique) to the subquery has helped. – wounky Nov 19 '19 at 12:36
  • I like this solution a lot and I think I will use it in the end as it provides the same rank for all the records having the same and fifth day in the result. – wounky Nov 19 '19 at 18:06
1

Use window functions to get the top 5:

select id, avg(reward)
from (select r.*,
             row_number() over (partition by l.id order by r.date desc) as seqnum
      from table1 l join
           table2 r
           on l.id = r.id and l.date > r.date
     ) r
where seqnum <= 5
group by id
having count(*) >= 5;

Then a having clause to filter out those ids that don't have five rows.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Here is how to do it with a join (if there are more joins you want to do, just repeat this method for every join

  SELECT ONE.ID, 
         CASE WHEN MAX(J1.RN) < 5 THEN NULL ELSE AVG(J1.REWARD) END AS REWARD_AVG
         -- we could also use count
       --CASE WHEN COUNT(*) = 5 THEN AVG(J1.REWARD) ELSE NULL END AS REWARD_AVG
  FROM TABLE_ONE ONE
  JOIN (
    SELECT
      ID,
      REWARD,
      ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DATE DESC) AS RN
    FROM TABLE_TWO
    WHERE TABLE_TWO.DATE < ONE.DATE
  ) AS J1 ON J1.ID = ONE.ID and RN <= 5 -- take first five only
  GROUP BY ONE.ID
Hogan
  • 69,564
  • 10
  • 76
  • 117
  • Thank you so much! Will think about it, try out and provide feedback! – wounky Nov 18 '19 at 20:21
  • Still thinking about it but In the following example, I do not see the dates comparison, a date from table_one should also be taken into the consideration. It should be greater than dates from the latest 5 **but older** records in the right_table. Would a where clause be enough at the end? – wounky Nov 18 '19 at 21:08
  • Trying to understand @wounky – Hogan Nov 18 '19 at 21:09
  • 1
    @wounky I put a where in the sub-query that will make all the dates in table two less than the date in table one... is this what you need? – Hogan Nov 18 '19 at 21:14
  • ...so if dog has barked in January 2019, I would like only to consider rewards he got before January 2019. Even if there were 999 of them, calculate an average only on the latest 5 relatively to the bark. – wounky Nov 18 '19 at 21:15
  • @wounky so that is what it does ... table 1 has jan 19 it will ignore items in table 2 after that. – Hogan Nov 18 '19 at 21:16
  • Yes, i think so, then it should work for all of the records. Meaning even if there was one in the past that did not have a reward assigned, we could calculate average of the latest five relatively to it's time. Will try out. – wounky Nov 18 '19 at 21:18