1

I have the following code which creates windows and aggregates values in the windows.

df.groupBy(window("time", "30 minutes"))\
  .agg(func.countDistinct("customer_numbers")

The window column (column that holds the time periods) is now a struct with two datetimes.

[datetime1, datetime2]. 

My dataframe looks like this :

window                                       customer_numbers
[2018-02-04:10:00:00, 2018-02-04:10:30:00]          10
[2018-02-04:10:30:00, 2018-02-04:11:00:00]          15

I want it to look like this

start                             End               customer_numbers
2018-02-04:10:00:00        2018-02-04:10:30:00        10
2018-02-04:10:30:00        2018-02-04:11:00:00        15

I want to split this out into two columns and get rid of the original window column but i cant seem to find a way to do this. I tried using UDFs though i think in scala you can simply do something like to get the first item i do not know how to do this in pyspark. I tried adding a UDF but instead of it giving me the first value it gave me a calendar.

.withColumn("key", $"window"._1)

Does anyone know how i can achieve this?

SecretAgent
  • 97
  • 10
  • 1
    Possible duplicate of [Querying Spark SQL DataFrame with complex types](https://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types) – philantrovert Jun 07 '18 at 07:48

1 Answers1

3

You can simply use select function as

.select(func.col('window')[0].alias('start'), func.col('window')[1].alias('end'), func.col('customer_numbers')).drop('window')
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97