I have the following code which creates windows and aggregates values in the windows.
df.groupBy(window("time", "30 minutes"))\
.agg(func.countDistinct("customer_numbers")
The window column (column that holds the time periods) is now a struct with two datetimes.
[datetime1, datetime2].
My dataframe looks like this :
window customer_numbers
[2018-02-04:10:00:00, 2018-02-04:10:30:00] 10
[2018-02-04:10:30:00, 2018-02-04:11:00:00] 15
I want it to look like this
start End customer_numbers
2018-02-04:10:00:00 2018-02-04:10:30:00 10
2018-02-04:10:30:00 2018-02-04:11:00:00 15
I want to split this out into two columns and get rid of the original window column but i cant seem to find a way to do this. I tried using UDFs though i think in scala you can simply do something like to get the first item i do not know how to do this in pyspark. I tried adding a UDF but instead of it giving me the first value it gave me a calendar.
.withColumn("key", $"window"._1)
Does anyone know how i can achieve this?