4

I have a datalake on AWS, queried using Athena, with the following structure and sample data

Key |     Date      |  Value
----+---------------+-------
 a  |   01/01/2020  |  4.5
 a  |   05/01/2020  |  6
 a  |   06/01/2020  |  3.2
 b  |   01/01/2020  |  2.4
 b  |   03/01/2020  |  5

I would like to run a query to extract values for a specific date and for each key. If date is not a known one, like 99% of the time, value should be returned as a linear interpolation of the two closest one.

Dates are here reported in dd/mm/YYYY format for simplicity, but in the datalake are stored as timestamps.

Example of result

If I want to get the values for the 2nd of January (02/01/2020), the expected output is

Key |     Date      |  Value
----+---------------+-------
 a  |   02/01/2020  |  4.875
 b  |   02/01/2020  |  3.70

Where 4.875 is the linear interpolation between 4.5 (value at 01/01/2020) and 6 (value at 05/01/2020). I have manually evaluated it as (y - 4.5) / (2 - 1) = (6 - 4.5) / (5 - 1) (see linear interpolation for more reference).

Same for 3.7

How can I achieve that (if possible) with one single query?

Assumption: we always have a smaller and bigger date from the point we're searching for.

Update - Athena, based on PrestoDB, does not support JOIN LATERAL, so that is not an option I can consider

GMB
  • 216,147
  • 25
  • 84
  • 135
70ny
  • 748
  • 1
  • 7
  • 22

1 Answers1

3

This might be a good place for lateral joins:

select d.dt, 
    case 
        when n.date = p.date then p.value
        else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
    end as new_value
from (select date '2020-04-01') d(date)
cross join lateral (
    select t.* from mytable t where t.date <= d.date order by t.date desc limit 1
) p  -- "previous" value
cross join lateral (
    select t.* from mytable t where t.date >= d.date order by t.date limit 1
) n  -- "next" value

We can write the query without lateral joins:

select date '2020-04-01' as dt, p.k,
    case 
        when n.date = p.date then p.value
        else p.value + (n.value - p.value) / datediff('day', n.date, p.date)
    end as new_value
from (
    select t.*, 
        row_number() over(partition by k order by date desc) as rn
    from mytable t
    where date <= '2020-04-01'
) p
inner join (
    select t.*, 
        row_number() over(partition by k order by date) as rn
    from mytable t
    where date >= '2020-04-01'
) n on n.k = p.k
where p.rn = 1 and n.rn = 1

This also generalizes the query so it can process multiple keys at once (key is language keyword, I used k instead).

GMB
  • 216,147
  • 25
  • 84
  • 135
  • 1
    thanks for your reply! It took me a while as running this query in Athena was returning me a 500 error. I then contacted their support and they informed me that Athena does not implement `JOIN LATERAL`, since it's based on PrestoDB (while it's available in PrestoSQL instead). So this solution does not work on Athena. – 70ny Dec 22 '20 at 13:37
  • 3
    @70ny: the lateral join was just a convenience. See my edit. – GMB Dec 22 '20 at 14:17
  • 1
    thanks for the update! I understand your second query but that will return just one row, while I need one row per key. `limit 1` is causing to get just one element, and not one element per key (I should have noticed that also in your first proposed query) – 70ny Dec 22 '20 at 15:08
  • 3
    @70ny: OK. I changed the second query. – GMB Dec 23 '20 at 01:55