Training autoencoder for variant length time series - Tensorflow

Question

I am trying to train a LSTM model to reconstruct time series data. I have a data set of ~1800 univariant time-series. Basically I'm trying to solve a problem similar to this one Anomaly detection in ECG plots, but my time series have different lengths.

I used this approach to deal with variant length: How to apply LSTM-autoencoder to variant-length time-series data? and this approach to split the input data based on shape: Keras misinterprets training data shape

When looping over the data and fitting a model for every shape. is the model eventually only based on the last shape it trained on or is it using all the data to train the final model?

How would I train the model on all input data regardless shape of data? I know I can add padding but I am trying to use the data as is at this point. Any suggestions or other approaches to deal with different length on timeseries? (It is not an issue of time sampling it is more of one timeseries started recording on day X and some only on day X+100)

Here is the code I am using for my autoencoder:

import keras.backend as K
from keras.layers import (Input, Dense, TimeDistributed, LSTM, GRU, Dropout, merge,
                      Flatten, RepeatVector, Bidirectional, SimpleRNN, Lambda)


def encoder(model_input, layer, size, num_layers, drop_frac=0.0, output_size=None,
        bidirectional=False):
    """Encoder module of autoencoder architecture"""
   if output_size is None:
      output_size = size
   encode = model_input
   for i in range(num_layers):
       wrapper = Bidirectional if bidirectional else lambda x: x
       encode = wrapper(layer(size, name='encode_{}'.format(i),
                           return_sequences=(i < num_layers - 1)))(encode)
       if drop_frac > 0.0:
          encode = Dropout(drop_frac, name='drop_encode_{}'.format(i))(encode)
  encode = Dense(output_size, activation='linear', name='encoding')(encode)
  return encode


def repeat(x):

   stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
   latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)

   return K.batch_dot(stepMatrix,latentMatrix)


def decoder(encode, layer, size, num_layers, drop_frac=0.0, aux_input=None,
        bidirectional=False):
   """Decoder module of autoencoder architecture"""

   decode = Lambda(repeat)([inputs,encode])
   if aux_input is not None:
       decode = merge([aux_input, decode], mode='concat')

   for i in range(num_layers):
       if drop_frac > 0.0 and i > 0:  # skip these for first layer for symmetry
           decode = Dropout(drop_frac, name='drop_decode_{}'.format(i))(decode)
       wrapper = Bidirectional if bidirectional else lambda x: x
       decode = wrapper(layer(size, name='decode_{}'.format(i),
                           return_sequences=True))(decode)

   decode = TimeDistributed(Dense(1, activation='linear'), name='time_dist')(decode)
   return decode


inputs = Input(shape=(None, 1))
encoded = encoder(inputs,LSTM,128, 2, drop_frac=0.0, output_size=None, bidirectional=False)
decoded = decoder(encoded, LSTM, 128, 2, drop_frac=0.0, aux_input=None,
          bidirectional=False,)


sequence_autoencoder = Model(inputs, decoded)
sequence_autoencoder.compile(optimizer='adam', loss='mae')


trainByShape = {}
for item in train_data:
  if item.shape in trainByShape:
    trainByShape[item.shape].append(item)
  else:
    trainByShape[item.shape] = [item]

for shape in trainByShape:
    modelHistory =sequence_autoencoder.fit(
              np.asarray(trainByShape[shape]), 
              np.asarray(trainByShape[shape]),
              epochs=100, batch_size=1, validation_split=0.15)

@GoldenLion bidirectional is s et to false, as I do not want to fill in any data. I want to take timeseries as is and train a model to reconstruct. the only restrain is different lengths of time series in the training set. segmenting the data is not an option at this point — Ybg, Jan 28 '22 at 23:02
@GoldenLion, Data is from different jobs. Anomalies can happen pretty fast so smoothing the data or resampling might lead to loss of information. ( just like in the ecg example I added think of it as if each timeseries a a different persons heart rate) — Ybg, Jan 31 '22 at 18:40

Golden Lion · Accepted Answer · 2022-01-31T22:29:12.013

use a bidirectional lstm and increase the number of parameters to gain accuracy. I increased the latent_dim to 1000 and it fit the data closely. More hardware and more memory.

def create_dataset(dataset, look_back=3):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back)]
        dataX.append(a)
        dataY.append(dataset[i + look_back])
    return np.array(dataX), np.array(dataY)

COLUMNS=['Open']
dataset=eqix_df[COLUMNS]
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(np.array(dataset).reshape(-1,1))

train_size = int(len(dataset) * 0.70)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size], dataset[train_size:len(dataset)]

look_back=10
trainX=[]
testX=[]
y_train=[]

trainX, y_train = create_dataset(train, look_back)
testX, y_test = create_dataset(test, look_back)

X_train = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
X_test = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
latent_dim=700
n_future=1

model = Sequential()

model.add(Bidirectional(LSTM(units=latent_dim, return_sequences=True, 
                             input_shape=(X_train.shape[1], 1))))

#LSTM 1
model.add(Bidirectional(LSTM(latent_dim,return_sequences=True,dropout=0.4,recurrent_dropout=0.4,name='lstm1'))) 

#LSTM 2 
model.add(Bidirectional(LSTM(latent_dim,return_sequences=True,dropout=0.2,recurrent_dropout=0.4,name='lstm2')))

#LSTM 3 
model.add(Bidirectional(LSTM(latent_dim, return_sequences=False,dropout=0.2,recurrent_dropout=0.4,name='lstm3')))

model.add(Dense(units = n_future))

model.compile(optimizer="adam", loss="mean_squared_error", metrics=["acc"])

history=model.fit(X_train, y_train,epochs=50,verbose=0)

plt.plot(history.history['loss'])
plt.title('loss accuracy')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

#print(X_test)
prediction = model.predict(X_test)

# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(prediction)+look_back, :] = prediction
# shift test predictions for plotting
#plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot, color='red')
#plt.plot(testPredictPlot)
#plt.legend(['Actual','Train','Test'])
x=np.linspace(look_back,len(prediction)+look_back,len(y_test))
plt.plot(x,y_test)
plt.show()

Thanks, I'll give it a try. Just to make sure I understand you are batching the data to 10 data points per sample and fitting to data point 11? Are you not shuffling the data at any point? — Ybg, Feb 02 '22 at 02:17
att1..att140 features and target. What does the target mean? — Golden Lion, Feb 02 '22 at 15:09
att1...att140 are the timeseries values. (each row is a different timeseries) Target is the classification of the graph i.e normal (1) or other types of anomalies). Do you still need code to load data? — Ybg, Feb 02 '22 at 18:24
I am watch a video on how to plot the ecg. do you have code for ploting. from scipy.io import arff import pandas as pd data = arff.loadarff('ECG5000_TRAIN.arff') df = pd.DataFrame(data[0]) I used this code to load the data — Golden Lion, Feb 03 '22 at 15:10
I found https://www.kaggle.com/phrazore/ecg-time-series-anomaly-detection — Golden Lion, Feb 03 '22 at 15:31
the data is partitioned into five classes: 'Normal','R on T','PVC','SP','UB' — Golden Lion, Feb 03 '22 at 15:38
the att1 through att140 are data points in the time series. now I am understand the flow — Golden Lion, Feb 03 '22 at 15:41
normal=df.query("target==b'1'").drop(labels='target', axis=1).mean(axis=0).to_numpy() I was able to get a sequence of data points. why do they average the timeseries points — Golden Lion, Feb 03 '22 at 18:24
Normal has 140 columns and 292 rows of data. I average all the rows into 140 data points then convert it to a numpy array — Golden Lion, Feb 03 '22 at 18:38
I think I can use the lstm network as a classifier of the 5 classes. I would feed the network a segment of data points and see if it correctly identifies the class — Golden Lion, Feb 03 '22 at 18:41

score 0 · Answer 2 · answered Feb 01 '22 at 14:29

0

Keras LSTM implementation expect a input of type: (Batch, Timesteps, Features).

One solution would be to set Timesteps = 1 and pass the sequence lengths as the Batch dimensions.

If the sampling procedure is the same (no need for resampling), and the difference in length only comes from when the recording time start (X+100 instead of X), I would try to get rid off the lag in the pre-processing stages to get the section of interest only.

answered Feb 01 '22 at 14:29

Yoan B. M.Sc

1,485
5
18

The extra data is still of interest as it can still present anomalies I want to detect – Ybg Feb 02 '22 at 02:21
@Ybg you know your problem better than I do, but if the time lag is "before recording" I would not expect this section to have any info. The proposed solution still work though even if you want to keep the part before recording. – Yoan B. M.Sc Feb 02 '22 at 13:44

Golden Lion · Answer 3 · 2022-02-05T03:59:05.707

Part 1: Plotting the irregular heartbeat. Part 2 is a DENSE network to classify incoming heartbeat voltage to predict irregular beat patterns. 94% accuracy!

from scipy.io import arff
import pandas as pd
from scipy.misc import electrocardiogram
import matplotlib.pyplot as plt
import numpy as np
data = arff.loadarff('ECG5000_TRAIN.arff')
df = pd.DataFrame(data[0])

#for column in df.columns:
#    print(column)
    
columns=[x for x in df.columns if x!="target"]    
print(columns)

#print(df[df.target == "b'1'"].drop(labels='target', axis=1).mean(axis=0).to_numpy())
normal=df.query("target==b'1'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
rOnT=df.query("target==b'2'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
pcv=df.query("target==b'3'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
sp=df.query("target==b'4'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
ub=df.query("target==b'5'").drop(labels='target', axis=1).mean(axis=0).to_numpy()

plt.plot(normal,label="Normal")
plt.plot(rOnT,label="R on T",alpha=.3)
plt.plot(pcv, label="PCV",alpha=.3)
plt.plot(sp, label="SP",alpha=.3)
plt.plot(ub, label="UB",alpha=.3)
plt.legend()
plt.title("ECG")
plt.show()

Frame by frame comparision for normal. There are bands of operation which a normal heart stays with:

def PlotTheFrames(df,title,color):
    fig,ax = plt.subplots(figsize=(140,50))        
    for key,item in df.iterrows():
        array=[]
        for value in np.array(item).flatten():
            array.append(value);
        x=np.linspace(0,100,len(array))
        ax.plot(x,array,c=color)
    plt.title(title)
    plt.show()

normal=df.query("target==b'1'").drop(labels='target', axis=1)
PlotTheFrames(normal,"Normal Heart beat",'r')

R on T the valves don't seem to be operating correctly

rOnT=df.query("target==b'2'").drop(labels='target', axis=1)

PlotTheFrames(rOnT,"R on T Heart beat","b")

Use a deep learning dense network instead of LSTM! I used leakyReLU for the smaller gradient descent

X=df[columns]
y=pd.get_dummies(df['target'])


model=Sequential()

model.add(Dense(440, input_shape=(len(columns),),activation='LeakyReLU'))
model.add(Dropout(0.4))
model.add(Dense(280, activation='LeakyReLU'))
model.add(Dropout(0.2))
model.add(Dense(240, activation='LeakyReLU'))
model.add(Dense(32, activation='LeakyReLU'))
model.add(Dense(16, activation='LeakyReLU'))
model.add(Dense(5))

   
  
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
    
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42)
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train=scaler.transform(X_train)
    X_test=scaler.transform(X_test)

 history=model.fit(X_train, y_train,epochs = 1000,verbose=0)

model.evaluate(X_test, y_test)

plt.plot(history.history['loss'])
plt.title('loss accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

The abnormal heart beat seems like an electrical chemical problem where there is a slight imbalance. — Golden Lion, Feb 03 '22 at 21:29
the area of interest is attn100:attn140. I think you could drop the other columns, they don't contribute much to the signal — Golden Lion, Feb 03 '22 at 21:32
the ub class seems like the heart is overstimulated almost like the time is off, chemicals build up at the wrong interval causing an electrical storm — Golden Lion, Feb 03 '22 at 21:34
I am going to use a batch approach to training the network. you can see through the visualizations that segments of time in repeating frames are of interest. — Golden Lion, Feb 03 '22 at 21:39

Training autoencoder for variant length time series - Tensorflow

3 Answers3