2

enter image description here

The first line represents a 1d line of time. The small strikes are data points.

The second line represents clustering of the lines and in the center their centeriods.

Which methods are used to predict the next cluster strikes/centroids will be as seen in the third image?

I found these questions:

If it helps at all. I cannot supply training data, only historic as it might change (perhaps historic data can be used as training data?)

Community
  • 1
  • 1
basickarl
  • 37,187
  • 64
  • 214
  • 335

1 Answers1

0

Interesting question.

I would try two approaches depending on how robust we can figure out the centroid of each cluster.

Aprroach 1

If as illustrated in your figure, a threshold-based method can be used to cluster the data points. We can first figure out the centroid at c1, c2, and c3 and treat the cluster centroids as a time-series. Then it should not be hard to predict the next position of the centroid c4 using some exponential smoothing. Suppose your original data looks like this:

0 3 4 5     11 12 14     21 23 25     34 37 38       ???
(cluster1)  (cluster2)  (cluster3)   (cluster4)      

If we have a threshold of 5 (the max distance between any two consecutive data points), we can easily figure out the 4 clusters above. We can then get below centroid positions. (for example, (21+23+25)/3=23 ).

 3            12.33         23          36.33        42 (predicted)

Suppose after applying some basic exponential smoothing, you get the next centroid position c4=42.

The next step is to predict the relative position of strikes based on a centroid. We can use all previous strikes in previous clusters serving as our training data. I would just get some statistics from the history data to see how well it works.

The average distance between strikes within a cluster:

(3+1+1)+(1+2) + (2+2) + (3+1)
 ----------------------------  = 1.55
                9

and

% of times we have 4 strikes: 25%
% of times we have 3 strikes: 75%

Then we can do a lottery (25% vs 75%) to decide whether we will have 4 strikes or 3 strikes at c4. The relative positions of the strikes can be derived using the average distance statistics. For example, if we have 3 strikes and the centroid is at 42:

strike1: 42-1.55 = 40.55
strike2: 42
strike3: 42+1.55 = 43.55

if we have 4 strikes:

strike1: 42-1.55/2-1.55
strike2: 42-1.55/2
strike3: 42+1.55/2
strike4: 42+1.55/2+1.55

Note: We will update our statistics once we observe the actual data points in c4 so we are always able to correct our prediction. We will update the exponential smoothing for the centroid prediction as well as soon as we have observed the actual data points in c4. Now we are ready to predict c5 and strikes in c5 and so on.

Approach 2

If we did not have a robust way to do the clustering, try linear regression with some feature-engineering. The target variable will be the x-position of the next strike. The features we can use could include:

  • The previous k points
  • The k-1 distances between every two consecutive points
  • The moving average of the most recent 3 points
  • The moving average of the most recent 5 points
  • ...
  • The moving average of the most recent 9 points
  • The n largest distances between two consecutive points (to measure inter-cluster distances)
  • The n smallest distances between two consecutive points (to measure intra-cluster distances)

Use your imagination and continue. Hope it helps.

Community
  • 1
  • 1
greeness
  • 15,956
  • 5
  • 50
  • 80
  • I actually have done two hierarchical clusterings to get rid of noise out of this 1d line. The original data has 5 features, this is the last peice of the puzzle!. I suppose you could call that feature engineering. You are correct, threshold via single-linkage hierarchical clustering will get me the centeroids, so that's not an issue. I will try your first approach, however this will not work if the pattern looks diffrently example: 1 2 4 5 7 8 etc. That is not the question however :) – basickarl Apr 08 '15 at 09:02