I have a dataset for multi-label binary (0/1) classification; some of the labels will never exist (the combination of row/column indices is impossible in my application), and this has been denoted in the input with -1.
I don't want the network to learn weights associated with the -1 values, and I don't want the loss function to be affected by them. To prevent this, I'm using a Masking
layer.
I'm trying to modify the accepted answer here to work in the multi-label case.
The dataset (X) consists of numpy arrays of size (124, 124) with (0/1/-1) values. There is a sequence of 7 such numpy arrays.
The labels (y_true) are (0/1/-1) in another (124, 124) array.
The accepted answer recommends using one-hot encoding for the binary values and the masking value, and Keras LSTM expects input to be in the shape [num_samples, num_timesteps, num_features]
. These two things mean my X and y_true shapes become (1, 7, 124*124*3) and (1, 124*124*3) below, respectively.
from keras.layers import Input, Masking, LSTM, Dense
from keras.models import Model
import numpy as np
import tensorflow as tf
from keras import backend as K
from keras.utils import to_categorical
from keras.optimizers import adam
import sys
#Creating some sample data
#Matrix has size 124*124, values -1, 0, 1
X = np.random.rand(1, 7, 124, 124).flatten()
X[X < 0.2] = 0.
X[X > 0.4] = -1.
X[(X > 0.2) & (X < 0.4)] = 1.
#Categories are 0, 1, -1 one-hot encoded
X = to_categorical(X, num_classes=3)
X = np.reshape(X, (7, 3, 124*124)) #X needs to be shape (num_samples, num_timesteps, num_features)
Y = np.random.rand(124, 124).flatten()
Y[Y < 0.2] = 0.
Y[Y > 0.4] = -1.
Y[np.where((Y > 0.2) & (Y < 0.4))] = 1.
Y = to_categorical(Y, num_classes=3)
y_true = np.reshape(Y, (1, 3, 124*124)) #predicting a single timestep
#Building the model
mask_val = np.tile([0,0,1], 124*124).reshape((3, 124*124))
input_tensor = Input(shape=(3, 124*124))
masking_input = Masking(mask_value=mask_val)(input_tensor)
lstm = LSTM(2, return_sequences=True)(masking_input)
output = Dense(124*124, activation='sigmoid')(lstm)
model = Model(input_tensor, output)
model.compile(loss='categorical_crossentropy', optimizer='adam')
print(model.summary())
y_pred = model.predict(X)
#See if the model's loss is the same as the unmasked loss (shouldn't be)
print(model.evaluate(np.expand_dims(X[0,:,:], axis=0), y_true))
Right now, I don't know if I'm taking the right approach. Will this mask the input properly, and not bother with learning weights to map from -1 to -1?
Also, how do I feed in the entire length-7 input sequence to model.evaluate
?
EDIT
Looking again through this discussion, masking does not do what I thought it did.
First, the reason for the one-hot encoding above is because masking doesn't work on 1D data, so masking the occurrence of -1 will not work (masking the occurrence of a vector that is entirely composed of -1 will work, but is not what I want).
Second, the intended use case of a masking layer is when a time step is missing from a sequence. For example, if you have a sequence of 10 words in one sentence and a sequence of 12 words in another sentence, it is common practice to zero-pad the input to allow both sentences to be the same length (i.e. tack on two zeros to the first input sentence). Masking would then be for detecting the zero-padded input values and (presumably) skipping the contribution those values would have on updating the weights.
Conclusion: I need a way to eliminate input values that are NOT time-dependent. Could I do this with a fixed Dropout layer? Or should I pretend that I have two sequences: a true time sequence (1 - 7) and a subsequence that could have occurrences of [0, 0, 1]?