I'm adding Multi-Headed attention at the input of my CNN to improve interpretability and explainability of my model. The data is formed as time-series 3D input of shape (125, 5, 6) where 5x6 part represents the data in a single sample and 125 represents 125 ordered samples (basically 1 second long recording of 5x6 time-series data). While reading about transformers for time-series and video data I've come across positional embedding. From my understanding, positional embedding is used to connect the data to its order so that learning is not completely random which would be wrong in the case of data where order is crucial. Now, one thing I don't understand is if my model needs positional embedding considering I input it in order as (125, 5, 6) and separate inputs don't have any connection, but the order of these 125 samples must be preserved.
Essentially, I'm not sure if the addition to my model should be:
Multi-headed attention -> Dropout -> Normalization -> the developed CNN
or
Positional-embedding -> Multi-headed attention -> Dropout -> Normalization -> the developed CNN
or
Multi-headed attention -> the developed CNN
or
Positional-embedding -> Multi-headed attention -> the developed CNN
or something completely different.
Another question is: Do I need more than a single layer of attention considering that the addition of it is solely with purpose of improving interpretability and explainability?
Ps. I'm aware the data needs to be reshaped into (125, 5, 6, 1) to be able to use Transformers.