1

I would like to list the missing date between two dates in a request for example

my data :

TABLE ORDER

DATE_order  | AMOUNT    
01/01/2020  |  500    
01/01/2020  |  600    
03/01/2020  |  100    
05/01/2020  |  300

I want the request to return

01/01/2020   | 1100    
02/01/2020   | 0    
03/01/2020   | 100    
04/01/2020   | 0    
05/01/2020   | 300

i use Cassandra database whith Apach Hive connector

someone can help me ?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
broly21
  • 93
  • 7
  • You need in recursive CTE for to generate the dates list from minimal till maximal without gaps. But HiveQL does not support this ([Apache Hive / ... / Common Table Expression](https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression) - *Recursive Queries are not supported*). – Akina Jan 21 '20 at 10:33
  • therefore it is not possible with hive? – broly21 Jan 21 '20 at 10:36
  • I wouldn's say. I only tell you that the common way for such task solving is unavailable - but maybe there exists some specific method or trick unknown to me. Some values generator, iterative user-defined function, something else... – Akina Jan 21 '20 at 10:44
  • thank you i will look towards this direction – broly21 Jan 21 '20 at 10:47
  • Please check my answer – leftjoin Jan 21 '20 at 11:57

1 Answers1

3

You can generate missing rows using lateral view and posexplode:

with your_data as (
select stack(4,
'2020-01-01',500,    
'2020-01-01',600,   
'2020-01-03',100,    
'2020-01-05',300
) as (DATE_order,AMOUNT )
)

select date_sub(s.date_order ,nvl(d.i,0)) as date_order, case when d.i > 0 then 0 else s.amount end as amount
from
(--find previous date
select date_order, amount, 
        lag(date_order) over(order by date_order) prev_date,
        datediff(date_order,lag(date_order) over(order by date_order)) datdiff
from
( --aggregate
 select date_order, sum(amount) amount from your_data group by date_order )s
)s
--generate rows
lateral view outer posexplode(split(space(s.datdiff-1),' ')) d as i,x
order by date_order;

Result:

date_order      amount
2020-01-01      1100
2020-01-02      0
2020-01-03      100
2020-01-04      0
2020-01-05      300
Time taken: 10.04 seconds, Fetched: 5 row(s)
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • it works but a group by is missing somewhere for me. I have duplicate dates, I try to correct I keep you posted quickly – broly21 Jan 21 '20 at 12:40
  • @broly21 Debug step by step. I hope you got general idea how to generate rows using explode+lateral view. If you have more keys, probably they should be added to teh partition by clause when calculating LAG: lag(date_order) over(PARTITION BY BLA, BLA order by date_order) – leftjoin Jan 21 '20 at 12:50
  • thank you i correct it was just i forgot to convert date(dateorder) in a group by, everything works fine thanks, but where do you manage the date interval? – broly21 Jan 21 '20 at 13:06
  • @broly21 date difference is calculated : `datediff(date_order,lag(date_order) over(order by date_order)) datdiff` then `space(datdiff)` - generates string of spaces with length=difference in days. Then split - generates array. then posexplode generates rows i is a position in array (days difference) it is subtracted from current row date to get missing date – leftjoin Jan 21 '20 at 13:09
  • OK thank you I understand better, I have the same case with the format YYYYMM example 201912 for December 2019, it will be difficult to reuse this request – broly21 Jan 21 '20 at 13:17
  • @broly21 Not difficult. You can add 01 as DD part and use add_months function instead of date_sub, calculate integer difference(of MM only) instead of datediff, etc. and execute your subqueries one by one for better understanding and debugging. Better create one more question if you have different requirements – leftjoin Jan 21 '20 at 13:30
  • @broly21 answered your question: https://stackoverflow.com/a/59849333/2700344 – leftjoin Jan 21 '20 at 21:29