Siamese Network with LSTM for sentence similarity in Keras gives periodically the same result

Question

I'm a newbie in Keras and I'm trying to solve the task of sentence similairty using NN in Keras. I use word2vec as word embedding, and then a Siamese Network to prediction how similar two sentences are. The base network for the Siamese Network is a LSTM, and to merge the two base network I use a Lambda layer with cosine similairty metric. As dataset I'm using SICK dataset, that gives a score to each pair of sentences, from 1(different) to 5(very similar).

I created the network and it runs, but I have a lot of doubts : first of all I'm not sure if the way I feed the LSTM with sentences is fine. I take word2vec embedding for each word and I create only one array per sentence, padding it with zeros to seq_len in order to obtain same lenght arrays. And then I reshape it in this way : data_A = embedding_A.reshape((len(embedding_A), seq_len, feature_dim))

Besides I'm not sure if my Siamese Network is correct, beacuse a lot of predictionion for different pairs are equal and the loss doesn't change much (from 0.3300 to 0.2105 in 10 epochs, and it doesn't change much more in 100 epochs).

Someone can help me find and understand my mistakes? Thanks so much (and sorry for my bad english)

Interested part in my code

def cosine_distance(vecs):
    #I'm not sure about this function too
    y_true, y_pred = vecs
    y_true = K.l2_normalize(y_true, axis=-1)
    y_pred = K.l2_normalize(y_pred, axis=-1)
    return K.mean(1 - K.sum((y_true * y_pred), axis=-1))

def cosine_dist_output_shape(shapes):
    shape1, shape2 = shapes
    print((shape1[0], 1))
    return (shape1[0], 1)

def contrastive_loss(y_true, y_pred):
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))

def create_base_network(feature_dim,seq_len):

    model = Sequential()  
    model.add(LSTM(100, batch_input_shape=(1,seq_len,feature_dim),return_sequences=True))
    model.add(Dense(50, activation='relu'))    
    model.add(Dense(10, activation='relu'))
    return model


def siamese(feature_dim,seq_len, epochs, tr_dataA, tr_dataB, tr_y, te_dataA, te_dataB, te_y):    

    base_network = create_base_network(feature_dim,seq_len)

    input_a = Input(shape=(seq_len,feature_dim,))
    input_b = Input(shape=(seq_len,feature_dim))

    processed_a = base_network(input_a)
    processed_b = base_network(input_b)

    distance = Lambda(cosine_distance, output_shape=cosine_dist_output_shape)([processed_a, processed_b])

    model = Model([input_a, input_b], distance)

    adam = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
    model.compile(optimizer=adam, loss=contrastive_loss)
    model.fit([tr_dataA, tr_dataB], tr_y,
              batch_size=128,
              epochs=epochs,
              validation_data=([te_dataA, te_dataB], te_y))


    pred = model.predict([tr_dataA, tr_dataB])
    tr_acc = compute_accuracy(pred, tr_y)
    for i in range(len(pred)):
        print (pred[i], tr_y[i])


    return model


def padding(max_len, embedding):
    for i in range(len(embedding)):
        padding = np.zeros(max_len-embedding[i].shape[0])
        embedding[i] = np.concatenate((embedding[i], padding))

    embedding = np.array(embedding)
    return embedding

def getAB(sentences_A,sentences_B, feature_dim, word2idx, idx2word, weights,max_len_def=0):
    #from_sentence_to_array : function that transforms natural language sentences 
    #into vectors of real numbers. Each word is replaced with the corrisponding word2vec 
    #embedding, and words that aren't in the embedding are replaced with zeros vector.  
    embedding_A, max_len_A = from_sentence_to_array(sentences_A,word2idx, idx2word, weights)
    embedding_B, max_len_B = from_sentence_to_array(sentences_B,word2idx, idx2word, weights)

    max_len = max(max_len_A, max_len_B,max_len_def*feature_dim)

    #padding to max_len
    embedding_A = padding(max_len, embedding_A)
    embedding_B = padding(max_len, embedding_B)

    seq_len = int(max_len/feature_dim)
    print(seq_len)

    #rashape
    data_A = embedding_A.reshape((len(embedding_A), seq_len, feature_dim))
    data_B = embedding_B.reshape((len(embedding_B), seq_len, feature_dim))

    print('A,B shape: ',data_A.shape, data_B.shape)

    return data_A, data_B, seq_len



FEATURE_DIMENSION = 100
MIN_COUNT = 10
WINDOW = 5

if __name__ == '__main__':

    data = pd.read_csv('data\\train.csv', sep='\t')
    sentences_A = data['sentence_A']
    sentences_B = data['sentence_B']
    tr_y = 1- data['relatedness_score']/5

    if not (os.path.exists(EMBEDDING_PATH)  and os.path.exists(VOCAB_PATH)):    
        create_embeddings(embeddings_path=EMBEDDING_PATH, vocab_path=VOCAB_PATH,  size=FEATURE_DIMENSION, min_count=MIN_COUNT, window=WINDOW, sg=1, iter=25)
    word2idx, idx2word, weights = load_vocab_and_weights(VOCAB_PATH,EMBEDDING_PATH)

    tr_dataA, tr_dataB, seq_len = getAB(sentences_A,sentences_B, FEATURE_DIMENSION,word2idx, idx2word, weights)

    test = pd.read_csv('data\\test.csv', sep='\t')
    test_sentences_A = test['sentence_A']
    test_sentences_B = test['sentence_B']
    te_y = 1- test['relatedness_score']/5

    te_dataA, te_dataB, seq_len = getAB(test_sentences_A,test_sentences_B, FEATURE_DIMENSION,word2idx, idx2word, weights, seq_len) 

    model = siamese(FEATURE_DIMENSION, seq_len, 10, tr_dataA, tr_dataB, tr_y, te_dataA, te_dataB, te_y)


    test_a = ['this is my dog']
    test_b = ['this dog is mine']
    a,b,seq_len = getAB(test_a,test_b, FEATURE_DIMENSION,word2idx, idx2word, weights, seq_len)
    prediction  = model.predict([a, b])
    print(prediction)

Some of the results :

my prediction | true label 
0.849908 0.8
0.849908 0.8
0.849908 0.74
0.849908 0.76
0.849908 0.66
0.849908 0.72
0.849908 0.64
0.849908 0.8
0.849908 0.78
0.849908 0.8
0.849908 0.8
0.849908 0.8
0.849908 0.8
0.849908 0.74
0.849908 0.8
0.849908 0.8
0.849908 0.8
0.849908 0.66
0.849908 0.8
0.849908 0.66
0.849908 0.56
0.849908 0.8
0.849908 0.8
0.849908 0.76
0.847546 0.78
0.847546 0.8
0.847546 0.74
0.847546 0.76
0.847546 0.72
0.847546 0.8
0.847546 0.78
0.847546 0.8
0.847546 0.72
0.847546 0.8
0.847546 0.8
0.847546 0.78
0.847546 0.8
0.847546 0.78
0.847546 0.78
0.847546 0.46
0.847546 0.72
0.847546 0.8
0.847546 0.76
0.847546 0.8
0.847546 0.8
0.847546 0.8
0.847546 0.8
0.847546 0.74
0.847546 0.8
0.847546 0.72
0.847546 0.68
0.847546 0.56
0.847546 0.8
0.847546 0.78
0.847546 0.78
0.847546 0.8
0.852975 0.64
0.852975 0.78
0.852975 0.8
0.852975 0.8
0.852975 0.44
0.852975 0.72
0.852975 0.8
0.852975 0.8
0.852975 0.76
0.852975 0.8
0.852975 0.8
0.852975 0.8
0.852975 0.78
0.852975 0.8
0.852975 0.8
0.852975 0.78
0.852975 0.8
0.852975 0.8
0.852975 0.76
0.852975 0.8

score 5 · Accepted Answer · answered Sep 30 '17 at 08:44

You're seeing consecutive equal values because the output shape of the function cosine_distance is wrong. When you take K.mean(...) without the axis argument, the result is a scalar. To fix it, just use K.mean(..., axis=-1) in cosine_distance to replace K.mean(...).

More Detailed Explanation:

When model.predict() is called, the output array pred is first pre-allocated, and then filled with the batch predictions. From the source code training.py:

if batch_index == 0:
    # Pre-allocate the results arrays.
    for batch_out in batch_outs:
        shape = (num_samples,) + batch_out.shape[1:]
        outs.append(np.zeros(shape, dtype=batch_out.dtype))
for i, batch_out in enumerate(batch_outs):
    outs[i][batch_start:batch_end] = batch_out

In your case you only have single output, so pred is just outs[0] in the code above. When batch_out is a scalar (for example, 0.847546 as seen in your results), the code above is equivalent to pred[batch_start:batch_end] = 0.847576. As the default batch size is 32 for model.predict(), you can see 32 consecutive 0.847576 values appear in your posted result.

Another possibly bigger problem is that the labels are wrong. You convert the relatedness score to labels by tr_y = 1- data['relatedness_score']/5. Now if two sentences are "very similar", the relatedness score is 5, so tr_y is 0 for these two sentences.

However, in the contrastive loss, when y_true is zero, the term K.maximum(margin - y_pred, 0) actually means that "these two sentences should have a cosine distance >= margin". That's the opposite of what you want your model to learn (also I don't think you need K.square in the loss).

Thank you so much for your help. I changed my cosine function and it worked :) But I still don't understand why my labels are wrong. In LeCun paper ([link](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)), about Contrastive Loss, is written "Let Y be a binary label assigned to this pair. Y = 0 if X1 and X2 are deemd similar, and Y = 1 if they are deemed dissimilar", and this is why I used that labels. Am I wrong? — MiVe93, Oct 05 '17 at 10:33
You can compare Eq. 4 with your `contrastive_loss` function. If you want Y = 0 to denote similar pairs as in the paper, you need to swap the positions of `y_true` and `(1 - y_true)` in `contrastive_loss`. — Yu-Yang, Oct 05 '17 at 11:39
Of course, you're right, now I get it! Thank you for your help and patience — MiVe93, Oct 05 '17 at 12:49

score 0 · Answer 2 · answered Apr 18 '18 at 18:58

Just to have this captured in an answer somewhere (I see it in the comments of the accepted answer), your contrastive loss function should be:

loss = K.mean((1 - y) * k.square(d) + y * K.square(K.maximum(margin - d, 0)))

Your (1 - y) * ... and y * ... were mixed up, which might throw people off who use your example as a starting point. It is otherwise an excellent starting point.

A note on nomenclature: You used y_true and y_pred instead of y and d. I use y and d because y are your labels, which should be either 0 or 1, but d is not necessarily in this same range (d is actually between 0 and 2 for cosine distance). It is not really a prediction of the value of y. You just want to minimize your distance measure d when two inputs are similar, and maximize it (or push it outside of your margin) when they are different. Basically contrastive loss is not trying to get d to predict y, just trying to get d to be small when same, large when different.

Siamese Network with LSTM for sentence similarity in Keras gives periodically the same result

2 Answers2

More Detailed Explanation: