0

I am quite new to machine learning and I was messing around with adversarial-examples. I am trying to fool a binary character-level LSTM text classifier. Thus I need the gradient of the loss w.r.t. the input.

The gradients function although returns None.

I already tried to get the gradients, like in this post or this post, but the gradients function still returns None.

EDIT: I wanted to do something similar than in this git repo.

I was thinking that the problem might be that it was an LSTM classifier. I am not sure at this point. But I think that it should be possible to get these gradients even from an LSTM classifier right?

Here is my code:

import numpy as np
from keras.preprocessing import sequence
from keras.models import load_model
import data
import pickle
import keras.backend as K

def adversary():
    model, valid_chars = loadModel()    
    model.summary()

    #load data
    X, y, maxlen, _ , max_features, indata = prepare_data(valid_chars)

    target = y[0]

    # Get the loss and gradient of the loss wrt the inputs  
    target = np.asarray(target).astype('float32').reshape((-1,1))
    loss = K.binary_crossentropy(target, model.output)
    print(target)
    print(model.output)
    print(model.input)
    print(loss)
    grads = K.gradients(loss, model.input)

    #f = K.function([model.input], [loss, grads])

    #print(f(X[1:2]))
    print(model.predict(X[0:1]))

    print(grads)

The output looks like this:

Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 74, 128)           5120      
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
=================================================================
Total params: 136,833
Trainable params: 136,833
Non-trainable params: 0
_________________________________________________________________
Maxlen: 74
Data preparing finished
[[0.]]
Tensor("activation_1/Sigmoid:0", shape=(?, 1), dtype=float32)
Tensor("embedding_1_input:0", shape=(?, 74), dtype=float32)
Tensor("logistic_loss_1:0", shape=(?, 1), dtype=float32)
[[1.1397913e-13]]
[None]

I was hoping to get the gradients of the loss w.r.t. the input data to see which of the characters has the most impact on the output. Thus I could fool the classifier by modifying the respective characters. Is this possible? If yes, what is wrong with my approach?

Thank you for your time.

MMikkk
  • 45
  • 7

2 Answers2

1

Gradients can only be computed for "trainable" tensors, so you might want to wrap your input into tf.Variable().

As soon as you want to work with gradient, I would suggest doing it using tensorflow, which nicely integrates with Keras. Below is my example of doing it, note that it works in eager execution mode (default in tensorflow 2.0).

def train_actor(self, sars):
    obs1, actions, rewards, obs2 = sars


    with tf.GradientTape() as tape:
        would_do_actions = self.compute_actions(obs1)
        score = tf.reduce_mean(self.critic(observations=obs1, actions=would_do_actions))
        loss = - score

    grads = tape.gradient(loss, self.actor.trainable_weights)
    self.optimizer.apply_gradients(zip(grads, self.actor.trainable_weights))
mari.mts
  • 683
  • 5
  • 9
  • Thanks for the answer. I am not very familiar with tensorflow though. I unfortunately cannot see how I can integrate your provided code into my program. Could you provide me a good resource where I could read up on that topic? – MMikkk May 19 '19 at 15:04
  • Official documentation is a good start: https://www.tensorflow.org/api_docs/python/tf/gradients, https://www.tensorflow.org/api_docs/python/tf/train/Optimizer – mari.mts May 19 '19 at 17:01
1

I just found this thread. The gradients function returns None because the embedding layer in not differentiable.

The embedding layer is implemented as K.gather which is not differentiable, so there is no gradient.

MMikkk
  • 45
  • 7