0

I am looking into generating two groups under the same ID. The example table is as below:

The input is like this:

enter image description here

The expected output is:

enter image description here

I tried to use dense_rank(), but it could not follow the timestamp order. Any help is appreciated!

Saxia
  • 15
  • 4

1 Answers1

1

A quick solution can be to utilize the TimeStamp value , generating Min, Max and Median to generate the dividing segment to populate 2 groups

If you want to divide into 2 equal halfs based on timestamp

Example -


input_list = [
               (1,None,111)    
               ,(1,None,120)
              ,(1,None,121)
              ,(1,None,124)
              ,(1,'p1',125)
              ,(1,None,126)
              ,(1,None,146)
              ,(1,None,147)
             ]

sparkDF = sql.createDataFrame(input_list,['id','p_id','timestamp'])

grp_window = Window.partitionBy('ID')
median_percentile = F.expr('percentile_approx(timestamp, 0.5)')


sparkDF = sparkDF.withColumn('min_timestamp',F.min(F.col('timestamp')).over(grp_window))\
                 .withColumn('median_value', median_percentile.over(grp_window))\
                 .withColumn('max_timestamp',F.max(F.col('timestamp')).over(grp_window))
                 
    
cond = (
     ( F.col('min_timestamp') <= F.col('timestamp') )
   & ( F.col('timestamp') <= F.col('median_value') + 1)
)


sparkDF = sparkDF.withColumn('p_id',F.when(cond,'p1').otherwise('p2'))

sparkDF.show()

+---+----+---------+-------------+------------+-------------+
| id|p_id|timestamp|min_timestamp|median_value|max_timestamp|
+---+----+---------+-------------+------------+-------------+
|  1|  p1|      111|          111|         124|          147|
|  1|  p1|      120|          111|         124|          147|
|  1|  p1|      121|          111|         124|          147|
|  1|  p1|      124|          111|         124|          147|
|  1|  p1|      125|          111|         124|          147|
|  1|  p2|      126|          111|         124|          147|
|  1|  p2|      146|          111|         124|          147|
|  1|  p2|      147|          111|         124|          147|
+---+----+---------+-------------+------------+-------------+

Vaebhav
  • 4,672
  • 1
  • 13
  • 33
  • Thank you for your quick answers! The example I showed might be a little bit confusing, now I changed it to a more common one. Thanks! – Saxia Aug 13 '21 at 00:16
  • Updated the answer with your sample data , I understand this might not the most elegant solution , but the idea was to provide an approach towards the problem – Vaebhav Aug 13 '21 at 15:34