I am trying to implement the char2vec model to convert or map the person's names into 50-dimensional or any N-dimensional vector. It is very similar to FastText's get_word_vector or scikit-learn's TfidfVectorizer.
Basically, I found the supervised LSTM Model from ethnicolr's notebook and I am trying to convert it into unsupervised or autoencoder model.
Here is the detail of the model. The input is the padding sequence of the bi-gram character of the person name.
Input:
person_name = [Heynis, Noordewier-Reddingius, De Quant, Ahanfouf, Falaturi ,...]
### Convert person name to sequence with post padding
X_train = array([[101, 25, 180, 95, 443, 9, 343, 198, 38, 84, 37, 0, 0, 0, 0, 0, 0],
[128, 27, 8, 6, 22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[142, 350, 373, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[999, 14, 33, 16, 512, 36, 52, 352, 14, 33, 5, 211, 143, 0, 0, 0, 0],
[146, 54, 99, 72, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
...]
Model:
model = Sequential()
model.add(Embedding(num_words, 32, input_length=feature_len))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(num_classes, activation='softmax'))
Ideally this is what I am looking for:
inputs = Input(shape=(feature_len, ))
embedded = Embedding(num_words, 32)(inputs)
encoded = LSTM(50, dropout=0.2, recurrent_dropout=0.2)(embedded)
decoded = LSTM()(encoded)
decoded_inverse_embedded= Inverse_Embedding()(decoded ) # I know it's silly.
outputs = Layer_something() # to convert back to the original shape
autoencoder_model= Model(inputs, outputs)
encoder = Model(inputs, encoded) # This is what I want, ultimately.
autoencoder_model.fit(X_train, X_train)
Here is what I tried: I got the code from https://stackoverflow.com/a/59576475/3015105. It seems like the training data was reshaped before it was fed to the model so no need for the Embedding layer. The RepeatVector and TimeDistributed layers were used for reshaping the output. The model seems right to me but I am not sure if this reshape and TimeDistributed layer are similar to Embedding layer or not.
sequence = X_train.reshape((len(X_train), feature_len, 1))
#define encoder
visible = Input(shape=(feature_len, 1))
encoder = LSTM(50, activation='relu')(visible)
# define reconstruct decoder
decoder1 = RepeatVector(feature_len)(encoder)
decoder1 = LSTM(50, activation='relu', return_sequences=True)(decoder1)
decoder1 = TimeDistributed(Dense(1))(decoder1)
myModel = Model(inputs=visible, outputs=decoder1)
myModel.fit(sequence, sequence, epochs=400)
The result does not seem correct. Are there another approaches to this problem? I have tried both FastText (via gensim) and TF-IDF model and I am curious if this model would be better.