I am quite new to machine learning and I was messing around with adversarial-examples. I am trying to fool a binary character-level LSTM text classifier. Thus I need the gradient of the loss w.r.t. the input.
The gradients function although returns None
.
I already tried to get the gradients, like in this post
or this post, but the gradients function still returns None
.
EDIT: I wanted to do something similar than in this git repo.
I was thinking that the problem might be that it was an LSTM classifier. I am not sure at this point. But I think that it should be possible to get these gradients even from an LSTM classifier right?
Here is my code:
import numpy as np
from keras.preprocessing import sequence
from keras.models import load_model
import data
import pickle
import keras.backend as K
def adversary():
model, valid_chars = loadModel()
model.summary()
#load data
X, y, maxlen, _ , max_features, indata = prepare_data(valid_chars)
target = y[0]
# Get the loss and gradient of the loss wrt the inputs
target = np.asarray(target).astype('float32').reshape((-1,1))
loss = K.binary_crossentropy(target, model.output)
print(target)
print(model.output)
print(model.input)
print(loss)
grads = K.gradients(loss, model.input)
#f = K.function([model.input], [loss, grads])
#print(f(X[1:2]))
print(model.predict(X[0:1]))
print(grads)
The output looks like this:
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 74, 128) 5120
_________________________________________________________________
lstm_1 (LSTM) (None, 128) 131584
_________________________________________________________________
dropout_1 (Dropout) (None, 128) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 129
_________________________________________________________________
activation_1 (Activation) (None, 1) 0
=================================================================
Total params: 136,833
Trainable params: 136,833
Non-trainable params: 0
_________________________________________________________________
Maxlen: 74
Data preparing finished
[[0.]]
Tensor("activation_1/Sigmoid:0", shape=(?, 1), dtype=float32)
Tensor("embedding_1_input:0", shape=(?, 74), dtype=float32)
Tensor("logistic_loss_1:0", shape=(?, 1), dtype=float32)
[[1.1397913e-13]]
[None]
I was hoping to get the gradients of the loss w.r.t. the input data to see which of the characters has the most impact on the output. Thus I could fool the classifier by modifying the respective characters. Is this possible? If yes, what is wrong with my approach?
Thank you for your time.