I would like to create a pyspark dataframe composed of a list of datetimes with a specific frequency.
Currently I'm using this approach, which seems quite cumbersome and I'm pretty sure there are better ways
# Define date range
START_DATE = dt.datetime(2019,8,15,20,30,0)
END_DATE = dt.datetime(2019,8,16,15,43,0)
# Generate date range with pandas
timerange = pd.date_range(start=START_DATE, end=END_DATE, freq='15min')
# Convert to timestamp
timestamps = [int(x) for x in timerange.values.astype(np.int64) // 10 ** 9]
# Create pyspark dataframe from the above timestamps
(spark.createDataFrame(dates, IntegerType())
.withColumn('value_date', sf.from_unixtime('value'))
.drop('value')
.withColumnRenamed('value_date', 'date').show())
which otputs
+-------------------+
| date|
+-------------------+
|2019-08-15 20:30:00|
|2019-08-15 20:45:00|
|2019-08-15 21:00:00|
|2019-08-15 21:15:00|
|2019-08-15 21:30:00|
|2019-08-15 21:45:00|
|2019-08-15 22:00:00|
|2019-08-15 22:15:00|
|2019-08-15 22:30:00|
|2019-08-15 22:45:00|
|2019-08-15 23:00:00|
|2019-08-15 23:15:00|
|2019-08-15 23:30:00|
|2019-08-15 23:45:00|
|2019-08-16 00:00:00|
|2019-08-16 00:15:00|
|2019-08-16 00:30:00|
|2019-08-16 00:45:00|
|2019-08-16 01:00:00|
|2019-08-16 01:15:00|
+-------------------+
Can you suggest a smarter way to achieve this?
Thanks
Edit:
This seems to work
(spark.sql('SELECT sequence({start_date}, {end_date}, 60*15) as timestamp_seq'.format(
start_date=int(START_DATE.timestamp()), end_date=int(END_DATE.timestamp())
)).withColumn('timestamp', sf.explode('timestamp_seq'))
.select(sf.col('timestamp').cast('timestamp').alias('datetime'))).show()
but I'm unable to make it work without the conversion to timestamp.