To add to venuktan's answer, here is how to create a time-based sliding window using Spark SQL and retain the full contents of the window, rather than taking an aggregate of it. This was needed in my use case of preprocessing time series data into sliding windows for input into Spark ML.
One limitation of this approach is that we assume you want to take sliding windows over time.
Firstly, you may create your Spark DataFrame, for example by reading in a CSV file:
df = spark.read.csv('foo.csv')
We assume that your CSV file has two columns: one of which is a unix timestamp and the other which is a column you want to extract sliding windows from.
from pyspark.sql import functions as f
window_duration = '1000 millisecond'
slide_duration = '500 millisecond'
df.withColumn("_c0", f.from_unixtime(f.col("_c0"))) \
.groupBy(f.window("_c0", window_duration, slide_duration)) \
.agg(f.collect_list(f.array('_c1'))) \
.withColumnRenamed('collect_list(array(_c1))', 'sliding_window')
Bonus: to convert this array column to the DenseVector format required for Spark ML, see the UDF approach here.
Extra Bonus: to un-nest the resulting column, such that each element of your sliding window has its own column, try this approach here.
I hope this helps, please let me know if I can clarify anything.