(ANSI SQL in general and) T-SQL in particular does provide for
CREATE TABLE Table1
(date_time datetime, quantity int)
;
INSERT INTO Table1
(date_time, quantity)
VALUES
('2023-01-01 14:05:00', 5),
('2023-01-01 14:12:00', 12),
('2023-01-01 14:13:00', 13),
('2023-01-01 14:20:00', 17)
;
;with
List0(quantity) as (
select
1
union all
select
quantity + 1
from List0
where quantity < 10000
),
List1a as (
select top 1000000
t1.date_time,
l0.quantity
from List0 l0
left join Table1 t1
on t1.quantity = l0.quantity
order by l0.quantity
),
List1 as (
select top 1000000
(select date_time
from Table1
where quantity = (select min(quantity)
from Table1
where quantity >= l0.quantity)
) as date_time,
l0.quantity
from List0 l0
left join Table1 t1
on t1.quantity = l0.quantity
order by l0.quantity
)
select
-- * from List0
* from List1a
-- * from List1 where quantity <= (select max(quantity) from Table1)
OPTION (MAXRECURSION 10000)
;
This comes with some flexibility and some limitations.
Not sure, it works this way in Apache Spark - and for your actual needs.
However, instead of the recursive CTE (List0) you could either use a system table with a sufficient number of consecutive IDs or generate a number table.
See it in action: SQL Fiddle.
How does it work?
NB: For illustration purposes, List1a has been added.
- First, some kind of number table is needed to get the full sequence of required values. This is here the purpose of List0. (Technically, there is no need to name its column after the one in the actual data. And it has far more rows than required for the sample data - just to stress its generic nature.)
- List1a demonstrates how the
LEFT JOIN
fills the gaps in the original data. But all rows with no data for date_time come back with just null.
- The sub-
SELECT
in the SELECT
of List1 pulls the 'next available' date_time from Table1 in order to fill the gaps.
top 1000000
is needed for SQL Server to 'allow' the order by
inside a CTE. (Routinely, I pick the million - as it usually exceeds by far the number of expected rows...)
OPTION (MAXRECURSION 10000)
moves the recursion limit for List0 up from the standard value 100.
Please comment, if and as this requires adjustment / further detail.
Edit:
@thebluephantom kindly pointed out, Spark not to feature recursive queries.
are just two discussions for T-SQL / SQL for obtaining sequences.
SparkSQL on pyspark: how to generate time series?
is focussing on Spark.
My objective was to give a working example, how the issue at hand can be resolved... Unfortunately, I can't address the Spark specifics.