I have a column of binary sensor data. I want to identify a consecutive sequence of 1s which denote an event occurring, and also get the interval of time it lasted for. Is this doable with spark? Here is an example of the data I'm working with.
I would be able to do this if I could go through the data row by row, but for that I would need to do a collect() first, but then all my data would be on a single machine. Some ideas that I had were: is there a way to collect the data on the worker nodes, and then do an iterative algorithm on it to generate event information, and then get the data back on the driver. I also read there is thing called structured streaming with spark 2.2, but I'm not sure if that is what I'm looking for.
Any other ideas are welcome.
FYI, I'm working with pyspark, and I'm very new to this.