1

A relatively simple block of code running in Jupyter notebook (because I cannot for the life of me get it working in Google Colab) but I'm getting the below error when running the model. The linked post shows some more of the detail around the inputs I'm working with.

I think it may be a GPU related error. I've followed several similarly titled posts and CUDA/Visual Studio installation tutorials, to no avail.

import gensim
import os
from gensim.models import KeyedVectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('glove.6B.50d.txt')

word2vec_glove_file = get_tmpfile("glove.6B.50d.word2vec.txt")
glove2word2vec(glove_file, word2vec_glove_file)

model = KeyedVectors.load_word2vec_format(word2vec_glove_file, binary=False)

# get the vector for each word in the glove file

emb_dict = {}
glove = open(glove_file, encoding="utf8")
for line in glove:
    values = line.split()
    word = values[0]
    vector = numpy.asarray(values[1:], dtype='float32')
    emb_dict[word] = vector
glove.close()

# get the top 10k words in the tweets, and find their vector from the glove files

num_words = 10000
glove_dims = 50

emb_matrix = numpy.zeros((num_words, glove_dims))
for w, i in token.word_index.items():
    if i < num_words:
        vect = emb_dict.get(w)
        if vect is not None:
            emb_matrix[i] = vect
    else:
        break

from keras import models, layers
from keras.layers import Embedding, Flatten, Dense

glove_model = models.Sequential()
glove_model.add(Embedding(num_words, glove_dims, input_length=max_tweet_len))
glove_model.add(layers.LSTM(100))
#glove_model.add(Flatten())
glove_model.add(Dense(3, activation='softmax'))

glove_model.layers[0].set_weights([emb_matrix])
glove_model.layers[0].trainable = False

glove_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

history = glove_model.fit(train_seq_x, y_train, epochs=10, batch_size=64)

The error returned is:

Epoch 1/10
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-11-224ee1a00f0e> in <module>
     13 glove_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
     14 
---> 15 history = glove_model.fit(train_seq_x, y_train, epochs=10, batch_size=64)

~\anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
   1237                                         steps_per_epoch=steps_per_epoch,
   1238                                         validation_steps=validation_steps,
-> 1239                                         validation_freq=validation_freq)
   1240 
   1241     def evaluate(self,

~\anaconda3\lib\site-packages\keras\engine\training_arrays.py in fit_loop(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)
    194                     ins_batch[i] = ins_batch[i].toarray()
    195 
--> 196                 outs = fit_function(ins_batch)
    197                 outs = to_list(outs)
    198                 for l, o in zip(out_labels, outs):

~\anaconda3\lib\site-packages\tensorflow\python\keras\backend.py in __call__(self, inputs)
   3790         value = math_ops.cast(value, tensor.dtype)
   3791       converted_inputs.append(value)
-> 3792     outputs = self._graph_fn(*converted_inputs)
   3793 
   3794     # EagerTensor.numpy() will often make a copy to ensure memory safety.

~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in __call__(self, *args, **kwargs)
   1603       TypeError: For invalid positional/keyword argument combinations.
   1604     """
-> 1605     return self._call_impl(args, kwargs)
   1606 
   1607   def _call_impl(self, args, kwargs, cancellation_manager=None):

~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in _call_impl(self, args, kwargs, cancellation_manager)
   1643       raise TypeError("Keyword arguments {} unknown. Expected {}.".format(
   1644           list(kwargs.keys()), list(self._arg_keywords)))
-> 1645     return self._call_flat(args, self.captured_inputs, cancellation_manager)
   1646 
   1647   def _filtered_call(self, args, kwargs):

~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1744       # No tape is watching; skip to running the function.
   1745       return self._build_call_outputs(self._inference_function.call(
-> 1746           ctx, args, cancellation_manager=cancellation_manager))
   1747     forward_backward = self._select_forward_and_backward_functions(
   1748         args,

~\anaconda3\lib\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
    596               inputs=args,
    597               attrs=attrs,
--> 598               ctx=ctx)
    599         else:
    600           outputs = execute.execute_with_cancellation(

~\anaconda3\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

InvalidArgumentError:  indices[40,19] = 11263 is not in [0, 10000)
     [[node embedding_1/embedding_lookup (defined at C:\Users\Jorda\anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_1707]

Function call stack:
keras_scratch_graph

As requested in the comments, the method of tokenisation is as follows:

from keras.preprocessing import text, sequence

# create a tokenizer 
token = text.Tokenizer()
token.fit_on_texts(tweets)
word_index = token.word_index

max_tweet_len = 30

# convert text to sequence of tokens and pad them to ensure equal length vectors 

train_seq_x = sequence.pad_sequences(token.texts_to_sequences(x_train), maxlen=max_tweet_len)
test_seq_x = sequence.pad_sequences(token.texts_to_sequences(x_test), maxlen=max_tweet_len)

print(train_seq_x)
print(test_seq_x)
  • when you tokenize your text you must suppress (substitute with 0 or the padding token) all the words that are not in the vocabulary (all the words with an index >10000) – Marco Cerliani May 15 '20 at 22:18
  • Thanks for getting back to me! Is this not done by default when I instantiate the matrix with `emb_matrix = numpy.zeros((num_words, glove_dims))`? i.e. breaking if past the 10k will leave the vector as zero? – Jordan MacLachlan May 15 '20 at 22:22
  • no, emb matrix is simply the weight matrix... report the piece of code where you build train_seq_x – Marco Cerliani May 15 '20 at 22:24
  • Gotcha - apologies for the misunderstanding. I've just updated the post to include this block. – Jordan MacLachlan May 15 '20 at 22:29
  • ok, are you sure that the word indices of your tokenizer coincide with the word index in glove? – Marco Cerliani May 15 '20 at 22:42
  • 1
    To confirm the mapping is working correctly, I added two print calls to the emb_matrix construction loop: one to print the word `w`, another to print `vect`. The `vect` value for a given word matches the value in the original glove.6B.50d.txt file. Is this what you are referring to? – Jordan MacLachlan May 15 '20 at 22:50

1 Answers1

1

try to fit your tokenizer in this way

num_words = 10000
token = text.Tokenizer(num_words=num_words)

and when you define your model, change num_words with word_index+1 in Embedding layer

Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54