Use Tf-Idf with in Keras Model

Question

I've read my train, test and validation sentences into train_sentences, test_sentences, val_sentences

Then I applied Tf-IDF vectorizer on these.

vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(train_sentences)

X_train = vectorizer.transform(train_sentences)
X_val = vectorizer.transform(val_sentences)
X_test = vectorizer.transform(test_sentences)

And my model looks like this

model = Sequential()

model.add(Input(????))

model.add(Flatten())

model.add(Dense(256, activation='relu'))

model.add(Dense(32, activation='relu'))

model.add(Dense(8, activation='sigmoid'))

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Normally we pass embeddings matrix in the embeddings layer in case of word2vec.

How should I use Tf-IDF in Keras model? Please provide me with an example to use.

Thanks.

Why would you like to use TF/IDF values in the embedding layer? — Mathias Müller, Feb 12 '20 at 19:41
Actually my plan was to use 2 different types of inputs. 1) Tf-IDF (300) and 2) Word2vec embeddigs (300) and concatenate them into one and pass through the dense layers. I didn't see any examples stating that. — Mogambo, Feb 12 '20 at 19:44
Can you please clarify whether 1) you want to use TF/IDF values as _input_ for the embedding layer 2) you want to concatenate TF/IDF vectors with embedding vectors (the _output_ of embedding layers). Thanks. — Mathias Müller, Feb 12 '20 at 19:52
I want to concatenate Tf-IDF vectors with embedding vectors. Sorry for the confusion — Mogambo, Feb 12 '20 at 19:54
There will be one embedding vector for each word in an input sentence. This shape (sequence_length, embedding_size) is not compatible with _one_ TF/IDF vector for a sentence. How would you combine them? — Mathias Müller, Feb 12 '20 at 20:45

score 3 · Accepted Answer · answered Feb 12 '20 at 20:58

I cannot imagine a good reason for combining TF/IDF values with embedding vectors, but here is a possible solution: use the functional API, multiple Inputs and the concatenate function.

To concatenate layer outputs, their shapes must be aligned (except for the axis that is being concatenated). One method is to average embeddings and then concatenate to a vector of TF/IDF values.

Setting up, and some sample data

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.datasets import fetch_20newsgroups

import numpy as np

import keras

from keras.models import Model
from keras.layers import Dense, Activation, concatenate, Embedding, Input

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# some sample training data
bunch = fetch_20newsgroups()
all_sentences = []

for document in bunch.data:
  sentences = document.split("\n")
  all_sentences.extend(sentences)

all_sentences = all_sentences[:1000]

X_train, X_test = train_test_split(all_sentences, test_size=0.1)
len(X_train), len(X_test)

vectorizer = TfidfVectorizer(max_features=300)
vectorizer = vectorizer.fit(X_train)

df_train = vectorizer.transform(X_train)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

maxlen = 50

sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_train = pad_sequences(sequences_train, maxlen=maxlen)

Model definition

vocab_size = len(tokenizer.word_index) + 1
embedding_size = 300

input_tfidf = Input(shape=(300,))
input_text = Input(shape=(maxlen,))

embedding = Embedding(vocab_size, embedding_size, input_length=maxlen)(input_text)

# this averaging method taken from:
# https://stackoverflow.com/a/54217709/1987598

mean_embedding = keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1))(embedding)

concatenated = concatenate([input_tfidf, mean_embedding])

dense1 = Dense(256, activation='relu')(concatenated)
dense2 = Dense(32, activation='relu')(dense1)
dense3 = Dense(8, activation='sigmoid')(dense2)

model = Model(inputs=[input_tfidf, input_text], outputs=dense3)

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Model Summary Output

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_11 (InputLayer)           (None, 50)           0                                            
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 50, 300)      633900      input_11[0][0]                   
__________________________________________________________________________________________________
input_10 (InputLayer)           (None, 300)          0                                            
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 300)          0           embedding_5[0][0]                
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 600)          0           input_10[0][0]                   
                                                                 lambda_1[0][0]                   
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 256)          153856      concatenate_4[0][0]              
__________________________________________________________________________________________________
dense_6 (Dense)                 (None, 32)           8224        dense_5[0][0]                    
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 8)            264         dense_6[0][0]                    
==================================================================================================
Total params: 796,244
Trainable params: 796,244
Non-trainable params: 0

This is what my requirement is. Thanks a lot – Mogambo Feb 12 '20 at 21:26 — Mogambo, Feb 12 '20 at 21:26

Use Tf-Idf with in Keras Model

1 Answers1