Sensors (of the same type) scattered on my site are manually reporting on irregular intervals to my backend. Between reports the sensors aggregate events and report them as a batch.
The following dataset is a collection of sequence events data, batch collected. For example sensor 1 reported 2 times. On the first batch 2 events and on the second batch 3 events, while sensor 2 reported 1 time with 3 events.
I would like to use this data as my train data X
sensor_id | batch_id | timestamp | feature_1 | feature_n |
---|---|---|---|---|
1 | 1 | 2020-12-21T00:00:00+00:00 | 0.54 | 0.33 |
1 | 1 | 2020-12-21T01:00:00+00:00 | 0.23 | 0.14 |
1 | 2 | 2020-12-21T03:00:00+00:00 | 0.51 | 0.13 |
1 | 2 | 2020-12-21T04:00:00+00:00 | 0.23 | 0.24 |
1 | 2 | 2020-12-21T05:00:00+00:00 | 0.33 | 0.44 |
2 | 1 | 2020-12-21T00:00:00+00:00 | 0.54 | 0.33 |
2 | 1 | 2020-12-21T01:00:00+00:00 | 0.23 | 0.14 |
2 | 1 | 2020-12-21T03:00:00+00:00 | 0.51 | 0.13 |
My target y, is a score calculated from all the events collected by a sensor:
I.E socre_sensor_1 = f([[batch1...],[batch2...]])
sensor_id | final_score |
---|---|
1 | 0.8 |
2 | 0.6 |
I would like to predict y each time a batch is collected, I.E 2 predictions for a sensor with 2 reports.
LSTM model:
I've started with an LSTM model, since I'm trying to predict on a time-series of events.
My first thought was to select a fixed size input and to zero pad the input when the number of events collected is smaller than the input size.Then mask the padded value:
model.add(Masking(mask_value=0., input_shape=(num_samples, num_features)))
For example:
sensor_id | batch_id | timestamp | feature_1 | feature_n |
---|---|---|---|---|
1 | 1 | 2020-12-21T00:00:00+00:00 | 0.54 | 0.33 |
1 | 1 | 2020-12-21T01:00:00+00:00 | 0.23 | 0.14 |
Would produce the following input if selected length is 5:
[
[0.54, 0.33],
[0.23, 0.14],
[0,0],
[0,0],
[0,0]
]
However, the variance of number of events per sensor report in my train data is large, one report could collect 1000 events while the other one can collect 10. So if I'm selecting the average size (let's say 200), some inputs would be with a lot of padding, while other would be truncated and data will be lost.
I've heard about ragged tensors, but I'm not sure it fit my use case. How would one approach such a problem?