1

I have a dataset for multi-label binary (0/1) classification; some of the labels will never exist (the combination of row/column indices is impossible in my application), and this has been denoted in the input with -1.

I don't want the network to learn weights associated with the -1 values, and I don't want the loss function to be affected by them. To prevent this, I'm using a Masking layer.

I'm trying to modify the accepted answer here to work in the multi-label case.

The dataset (X) consists of numpy arrays of size (124, 124) with (0/1/-1) values. There is a sequence of 7 such numpy arrays.

The labels (y_true) are (0/1/-1) in another (124, 124) array.

The accepted answer recommends using one-hot encoding for the binary values and the masking value, and Keras LSTM expects input to be in the shape [num_samples, num_timesteps, num_features]. These two things mean my X and y_true shapes become (1, 7, 124*124*3) and (1, 124*124*3) below, respectively.

from keras.layers import Input, Masking, LSTM, Dense
from keras.models import Model
import numpy as np
import tensorflow as tf
from keras import backend as K
from keras.utils import to_categorical
from keras.optimizers import adam
import sys

#Creating some sample data

#Matrix has size 124*124, values -1, 0, 1
X = np.random.rand(1, 7, 124, 124).flatten()

X[X < 0.2] = 0.
X[X > 0.4] = -1.
X[(X > 0.2) & (X < 0.4)] = 1.

#Categories are 0, 1, -1 one-hot encoded
X = to_categorical(X, num_classes=3)
X = np.reshape(X, (7, 3, 124*124)) #X needs to be shape (num_samples, num_timesteps, num_features)

Y = np.random.rand(124, 124).flatten()
Y[Y < 0.2] = 0.
Y[Y > 0.4] = -1.
Y[np.where((Y > 0.2) & (Y < 0.4))] = 1.
Y = to_categorical(Y, num_classes=3)
y_true = np.reshape(Y, (1, 3, 124*124)) #predicting a single timestep

#Building the model

mask_val = np.tile([0,0,1], 124*124).reshape((3, 124*124))
input_tensor = Input(shape=(3, 124*124))
masking_input = Masking(mask_value=mask_val)(input_tensor)
lstm = LSTM(2, return_sequences=True)(masking_input)
output = Dense(124*124, activation='sigmoid')(lstm)

model = Model(input_tensor, output)
model.compile(loss='categorical_crossentropy', optimizer='adam')
print(model.summary())

y_pred = model.predict(X)

#See if the model's loss is the same as the unmasked loss (shouldn't be)
print(model.evaluate(np.expand_dims(X[0,:,:], axis=0), y_true))

Right now, I don't know if I'm taking the right approach. Will this mask the input properly, and not bother with learning weights to map from -1 to -1?

Also, how do I feed in the entire length-7 input sequence to model.evaluate?

EDIT

Looking again through this discussion, masking does not do what I thought it did.

First, the reason for the one-hot encoding above is because masking doesn't work on 1D data, so masking the occurrence of -1 will not work (masking the occurrence of a vector that is entirely composed of -1 will work, but is not what I want).

Second, the intended use case of a masking layer is when a time step is missing from a sequence. For example, if you have a sequence of 10 words in one sentence and a sequence of 12 words in another sentence, it is common practice to zero-pad the input to allow both sentences to be the same length (i.e. tack on two zeros to the first input sentence). Masking would then be for detecting the zero-padded input values and (presumably) skipping the contribution those values would have on updating the weights.

Conclusion: I need a way to eliminate input values that are NOT time-dependent. Could I do this with a fixed Dropout layer? Or should I pretend that I have two sequences: a true time sequence (1 - 7) and a subsequence that could have occurrences of [0, 0, 1]?

StatsSorceress
  • 3,019
  • 7
  • 41
  • 82

0 Answers0