Interesting question.
I would try two approaches depending on how robust we can figure out the centroid of each cluster.
Aprroach 1
If as illustrated in your figure, a threshold-based method can be used to cluster the data points. We can first figure out the centroid at c1
, c2
, and c3
and treat the cluster centroids as a time-series. Then it should not be hard to predict the next position of the centroid c4
using some exponential smoothing. Suppose your original data looks like this:
0 3 4 5 11 12 14 21 23 25 34 37 38 ???
(cluster1) (cluster2) (cluster3) (cluster4)
If we have a threshold of 5 (the max distance between any two consecutive data points), we can easily figure out the 4 clusters above. We can then get below centroid positions. (for example, (21+23+25)/3=23
).
3 12.33 23 36.33 42 (predicted)
Suppose after applying some basic exponential smoothing, you get the next centroid position c4=42
.
The next step is to predict the relative position of strikes based on a centroid. We can use all previous strikes in previous clusters serving as our training data. I would just get some statistics from the history data to see how well it works.
The average distance between strikes within a cluster:
(3+1+1)+(1+2) + (2+2) + (3+1)
---------------------------- = 1.55
9
and
% of times we have 4 strikes: 25%
% of times we have 3 strikes: 75%
Then we can do a lottery (25% vs 75%) to decide whether we will have 4 strikes or 3 strikes at c4
. The relative positions of the strikes can be derived using the average distance statistics. For example, if we have 3 strikes and the centroid is at 42:
strike1: 42-1.55 = 40.55
strike2: 42
strike3: 42+1.55 = 43.55
if we have 4 strikes:
strike1: 42-1.55/2-1.55
strike2: 42-1.55/2
strike3: 42+1.55/2
strike4: 42+1.55/2+1.55
Note: We will update our statistics once we observe the actual data points in c4
so we are always able to correct our prediction. We will update the exponential smoothing for the centroid prediction as well as soon as we have observed the actual data points in c4
. Now we are ready to predict c5
and strikes in c5
and so on.
Approach 2
If we did not have a robust way to do the clustering, try linear regression with some feature-engineering. The target variable will be the x-position of the next strike. The features we can use could include:
- The previous k points
- The k-1 distances between every two consecutive points
- The moving average of the most recent 3 points
- The moving average of the most recent 5 points
- ...
- The moving average of the most recent 9 points
- The n largest distances between two consecutive points (to measure inter-cluster distances)
- The n smallest distances between two consecutive points (to measure intra-cluster distances)
Use your imagination and continue. Hope it helps.