Preprocessing text data for keras LSTM

Question

Referring to the example given in the keras docs here: https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py

I would like to use my own dataset instead of IMDB. After inspecting the format of the default dataset, i see that each word in the sentence is replaced by its vocabulary index, which is sorted in descending order.

I was looking through the keras docs here https://keras.io/preprocessing/text/ for a method that would accomplish this, none of them seem to work for me.

I have been trying the

Tokenizer.fit_on_texts and Tokenizer.fit_on_sequences methods.

Fit on texts returns a

AttributeError: 'float' object has no attribute 'lower'

error.

My input is a pandas series of text.

Could anyone point me as to what I'm doing wrong? I have looked at the following thread and it did not help

Keras - Text Classification - LSTM - How to input text?

Thank you!

Wboy · Answer 1 · 2017-07-17T15:42:33.007

2

Found the error, one of the texts was NaN, which causes Tokenizer to break. Leaving this here incase it helps anyone :)

edited Jul 17 '17 at 15:42

answered Jul 17 '17 at 07:24

Wboy

2,452
2
24
45

Preprocessing text data for keras LSTM

1 Answers1