Enumerate blocks of successively equal values in Spark

Question

I want to find the IDs of groups (or blocks) of trues in a Spark DataFrame. That is, I want to go from this:

>>> df.show()
+---------+-----+
|timestamp| bool|
+---------+-----+
|        1|false|
|        2| true|
|        3| true|
|        4|false|
|        5| true|
|        6| true|
|        7| true|
|        8| true|
|        9|false|
|       10|false|
|       11|false|
|       12|false|
|       13|false|
|       14| true|
|       15| true|
|       16| true|
+---------+-----+

to this:

>>> df.show()
+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
|        1|false|    0|
|        2| true|    1|
|        3| true|    1|
|        4|false|    0|
|        5| true|    2|
|        6| true|    2|
|        7| true|    2|
|        8| true|    2|
|        9|false|    0|
|       10|false|    0|
|       11|false|    0|
|       12|false|    0|
|       13|false|    0|
|       14| true|    3|
|       15| true|    3|
|       16| true|    3|
+---------+-----+-----+

(the zeros are optional, could be Null or -1 or whatever is easier to implement)

`DataFrame` has no intrinsic order so there are no "blocks" there. You'll need some order definition, and to make it feasible on large dataset, some grouping variable. If you have both, see the solution linked below: — Alper t. Turker, Feb 09 '18 at 20:43
Possible duplicate of [Pyspark : Custom window function](https://stackoverflow.com/questions/40404060/pyspark-custom-window-function) — Alper t. Turker, Feb 09 '18 at 20:45
Also, please check [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/8371915) — Alper t. Turker, Feb 09 '18 at 20:47

score 1 · Accepted Answer · answered Feb 10 '18 at 20:36

I have a solution in scala, should be easy to adapt it to pyspark. Consider the following dataframe df:

+---------+-----+
|timestamp| bool|
+---------+-----+
|        1|false|
|        2| true|
|        3| true|
|        4|false|
|        5| true|
|        6| true|
|        7| true|
|        8| true|
|        9|false|
|       10|false|
|       11|false|
|       12|false|
|       13|false|
|       14| true|
|       15| true|
|       16| true|
+---------+-----+

then you could do:

df
  .withColumn("prev_bool",lag($"bool",1).over(Window.orderBy($"timestamp")))
  .withColumn("block",sum(when(!$"prev_bool" and $"bool",1).otherwise(0)).over(Window.orderBy($"timestamp")))
  .drop($"prev_bool")
  .withColumn("block",when($"bool",$"block").otherwise(0))
  .show()


+---------+-----+-----+
|timestamp| bool|block|
+---------+-----+-----+
|        1|false|    0|
|        2| true|    1|
|        3| true|    1|
|        4|false|    0|
|        5| true|    2|
|        6| true|    2|
|        7| true|    2|
|        8| true|    2|
|        9|false|    0|
|       10|false|    0|
|       11|false|    0|
|       12|false|    0|
|       13|false|    0|
|       14| true|    3|
|       15| true|    3|
|       16| true|    3|
+---------+-----+-----+

Enumerate blocks of successively equal values in Spark

1 Answers1