1

I am building a 1D CNN model using Keras for text classification where the input is a sequence of words generated by tokenizer.texts_to_sequences. Is there a way to also feed in a sequence of numerical features (e.g. a score) for each word in the sequence? For example, for sentence 1 the input would be ['the', 'dog', 'barked'] and each word in this particular sequence has the scores [0.9, 0.75, 0.6]. The scores are not word specific, but sentence specific scores of the words (if that makes a difference for how to format the input). Would an LSTM be more appropriate in this case?

Many thanks in advance!

Getch
  • 107
  • 2
  • 10

1 Answers1

1

Yes, just use 2 channels in the input tensor.

In better terms, if you input before had shape: (batch_size, seq_len)

Now you could have: (batch_size, seq_len, 2)

If you look at the Keras documentation, you see that with the parameter data_format you pass a string, one of channels_last (default) or channels_first. In this case the default would be fine, because the 2 (number of channels is last).

You can just stack the 2 input arrays into a tensor with this shape.

Now if you use a word embedding probably the number of channels will not be 2, but it would be embedding_dim + 1, so the final input shape would be: (batch_size, seq_len, embedding_dim + 1)

In general you can also refer to this other Stack Overflow question.

In any case, both CNN 1D and LSTM could be good models... but this you need to discover yourself depending on your task, data and model constraints.

Now as a final remark, you could even think of a model with multiple inputs one the word sequence and the other the scores. See this documentation page or this random tutorial I found on the internet. You can again refer also to the same SO question.

Luca Angioloni
  • 2,243
  • 2
  • 19
  • 28