Pyspark: Under the same id, seperate into two groups

Question

I am looking into generating two groups under the same ID. The example table is as below:

The input is like this:

The expected output is:

I tried to use dense_rank(), but it could not follow the timestamp order. Any help is appreciated!

can you use pandas instead, what is your condition for the split? — anarchy, Aug 12 '21 at 04:31
could you provide reproducible samples and not images? like this one https://stackoverflow.com/a/48427186/9477843 — AdibP, Aug 12 '21 at 04:35
are you trying to fill null values with backfill method like in pandas? — AdibP, Aug 13 '21 at 03:43

Vaebhav · Accepted Answer · 2021-08-13T15:33:43.243

A quick solution can be to utilize the TimeStamp value , generating Min, Max and Median to generate the dividing segment to populate 2 groups

If you want to divide into 2 equal halfs based on timestamp

Example -


input_list = [
               (1,None,111)    
               ,(1,None,120)
              ,(1,None,121)
              ,(1,None,124)
              ,(1,'p1',125)
              ,(1,None,126)
              ,(1,None,146)
              ,(1,None,147)
             ]

sparkDF = sql.createDataFrame(input_list,['id','p_id','timestamp'])

grp_window = Window.partitionBy('ID')
median_percentile = F.expr('percentile_approx(timestamp, 0.5)')


sparkDF = sparkDF.withColumn('min_timestamp',F.min(F.col('timestamp')).over(grp_window))\
                 .withColumn('median_value', median_percentile.over(grp_window))\
                 .withColumn('max_timestamp',F.max(F.col('timestamp')).over(grp_window))
                 
    
cond = (
     ( F.col('min_timestamp') <= F.col('timestamp') )
   & ( F.col('timestamp') <= F.col('median_value') + 1)
)


sparkDF = sparkDF.withColumn('p_id',F.when(cond,'p1').otherwise('p2'))

sparkDF.show()

+---+----+---------+-------------+------------+-------------+
| id|p_id|timestamp|min_timestamp|median_value|max_timestamp|
+---+----+---------+-------------+------------+-------------+
|  1|  p1|      111|          111|         124|          147|
|  1|  p1|      120|          111|         124|          147|
|  1|  p1|      121|          111|         124|          147|
|  1|  p1|      124|          111|         124|          147|
|  1|  p1|      125|          111|         124|          147|
|  1|  p2|      126|          111|         124|          147|
|  1|  p2|      146|          111|         124|          147|
|  1|  p2|      147|          111|         124|          147|
+---+----+---------+-------------+------------+-------------+

Thank you for your quick answers! The example I showed might be a little bit confusing, now I changed it to a more common one. Thanks! — Saxia, Aug 13 '21 at 00:16
Updated the answer with your sample data , I understand this might not the most elegant solution , but the idea was to provide an approach towards the problem — Vaebhav, Aug 13 '21 at 15:34

Pyspark: Under the same id, seperate into two groups

1 Answers1

Example -