0

I'm trying to cluster time series. I also want to use Sklearn OPTICS. In the documentation it says that the input vector X should have dimensions (n_samples,n_features). My array is on the form (n_samples, n_time_stamps, n_features). Example in code further down.

My question is how I can use the Fit-function from OPTICS with a time series. I know that people have used OPTICS and DBSCAN with time series. I just can't figure out how they have implemented it. Any help will be much appreciated.

[[[t00, x0], [t01, x01], ... [t0_n_timestamps, x0_n_timestamps]], 
 [[t10, x10], [t11, x11], ... [t1_n_timestamps, x1_n_timestamps]], 
.
.
.
 [[t_n_samples_0, x_n_samples_0], [[t_n_samples_1, x_n_samples_1], ... [t_n_samples_n_timestamps, x_n_samples_n_timestamps]]]
Eirik
  • 3
  • 2
  • Did you manage to solve this? Basically you have to transform your data so that you have n rows, then each value of time series in a separate column, take these columns, i.e. convert them with, f.e. to_numpy() and use the fit function. – SkogensKonung Mar 27 '22 at 08:33
  • No I did not. Am I understanding you correctly if you're saying I should change the data into the following form: [[value11, value12, ... value_1n], [value21, value22, ... value_2n], . . . [valuem1, valuem2, ... value_mn]] Where the each time series has n values and there is m time series – Eirik Mar 28 '22 at 10:28
  • Assuming that you had 3 time series, and each had three values, does the following confirm to the format of your data? `data = [ [["00:00", 7], ["00:01", 37], ["00:02", 3]], [["00:00", 27], ["00:01", 137], ["00:02", 33]], [["00:00", 14], ["00:01", 17], ["00:02", 12]]]` – SkogensKonung Mar 28 '22 at 12:25
  • yes that's correct – Eirik Mar 28 '22 at 13:24
  • Is your input data a numpy array or is it a list? How is time expressed in your data? I'm asking because if it is a numpy array, then it is much easier to try to transform it to a dataframe. If it is a regular python list, for example as the one I presented, then it is more challenging. – SkogensKonung Mar 28 '22 at 17:50
  • It is a 2D numpy array with time and values. Like this np.array([[t1, v1], [t2,v2], ... [tn,vn]]). – Eirik Mar 30 '22 at 09:14

1 Answers1

0

Given the following np.array as an input:

data = np.array([
    [["00:00", 7], ["00:01", 37], ["00:02", 3]],
    [["00:00", 27], ["00:01", 137], ["00:02", 33]],
    [["00:00", 14], ["00:01", 17], ["00:02", 12]],
    [["00:00", 15], ["00:01", 123], ["00:02", 11]],
    [["00:00", 16], ["00:01", 12], ["00:02", 92]],
    [["00:00", 17], ["00:01", 23], ["00:02", 22]],
    [["00:00", 18], ["00:01", 23], ["00:02", 112]],
    [["00:00", 100], ["00:01", 200], ["00:02", 301]],
    [["00:00", 101], ["00:01", 201], ["00:02", 302]],
    [["00:00", 102], ["00:01", 203], ["00:02", 303]],
    [["00:00", 104], ["00:01", 207], ["00:02", 304]]])

I will proceed as follows:

    # save shape info in three separate variables
    x, y, z = data.shape
    # idea from https://stackoverflow.com/a/36235454/5050691
    output_arr = np.column_stack((np.repeat(np.arange(x), y), data.reshape(x * y, -1)))
    # create a df out of the arr
    df = pd.DataFrame(output_arr)
    # rename for understandability
    df = df.rename(columns={0: 'index', 1: 'time', 2: 'value'})
    # Change the orientation between rows and columns so that rows
    # that contain time info become columns
    df = df.pivot(index="index", columns="time", values="value")
    df.rename_axis(None, axis=1).reset_index()
    # get columns that refer to specific interval of time series
    temporal_accessors = ["00:00", "00:01", "00:02"]
    # extract data that will be used to carry out clustering
    data_for_clustering = df[temporal_accessors].to_numpy()

    # a set of exemplary params
    params = {
        "xi": 0.05,
        "metric": "euclidean",
        "min_samples": 3
    }
    clusterer = OPTICS(**params)
    fitted = clusterer.fit(data_for_clustering)
    cluster_labels = fitted.labels_
    df["cluster"] = cluster_labels
    # Note: density based algortihms have a notion of the "noise-cluster", which is marked with
    # -1 by sklearn algorithms. That's why starting index is -1 for density based clustering,
    # and 0 otherwise.

For the given data and the presented choice of params, you'll get the following clusters: [0 0 1 0 0 0 0 0 1 1 1]

SkogensKonung
  • 601
  • 1
  • 9
  • 22