1

I want to find the IDs of groups (or blocks) of trues in a Spark DataFrame. That is, I want to go from this:

>>> df.show()
+---------+-----+
|timestamp| bool|
+---------+-----+
|        1|false|
|        2| true|
|        3| true|
|        4|false|
|        5| true|
|        6| true|
|        7| true|
|        8| true|
|        9|false|
|       10|false|
|       11|false|
|       12|false|
|       13|false|
|       14| true|
|       15| true|
|       16| true|
+---------+-----+

to this:

>>> df.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
|        1|false|    0|
|        2| true|    1|
|        3| true|    1|
|        4|false|    0|
|        5| true|    2|
|        6| true|    2|
|        7| true|    2|
|        8| true|    2|
|        9|false|    0|
|       10|false|    0|
|       11|false|    0|
|       12|false|    0|
|       13|false|    0|
|       14| true|    3|
|       15| true|    3|
|       16| true|    3|
+---------+-----+-----+

(the zeros are optional, could be Null or -1 or whatever is easier to implement)

prcastro
  • 2,178
  • 3
  • 19
  • 21

1 Answers1

1

I have a solution in scala, should be easy to adapt it to pyspark. Consider the following dataframe df:

+---------+-----+
|timestamp| bool|
+---------+-----+
|        1|false|
|        2| true|
|        3| true|
|        4|false|
|        5| true|
|        6| true|
|        7| true|
|        8| true|
|        9|false|
|       10|false|
|       11|false|
|       12|false|
|       13|false|
|       14| true|
|       15| true|
|       16| true|
+---------+-----+

then you could do:

df
  .withColumn("prev_bool",lag($"bool",1).over(Window.orderBy($"timestamp")))
  .withColumn("block",sum(when(!$"prev_bool" and $"bool",1).otherwise(0)).over(Window.orderBy($"timestamp")))
  .drop($"prev_bool")
  .withColumn("block",when($"bool",$"block").otherwise(0))
  .show()


+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
|        1|false|    0|
|        2| true|    1|
|        3| true|    1|
|        4|false|    0|
|        5| true|    2|
|        6| true|    2|
|        7| true|    2|
|        8| true|    2|
|        9|false|    0|
|       10|false|    0|
|       11|false|    0|
|       12|false|    0|
|       13|false|    0|
|       14| true|    3|
|       15| true|    3|
|       16| true|    3|
+---------+-----+-----+
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145