I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start
and stop
columns of type date
.
Suppose that my_table
contains:
start | stop
-------------------------
2000-01-01 | 2000-01-05
2012-03-20 | 2012-03-23
In PostgreSQL it's very easy to do that:
SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table
and it will generate this table:
dt
------------
2000-01-01
2000-01-02
2000-01-03
2000-01-04
2000-01-05
2012-03-20
2012-03-21
2012-03-22
2012-03-23
but how to do that using plain SparkSQL? Will it be necessary to use UDFs or some DataFrame methods?