1

I am trying to implement the char2vec model to convert or map the person's names into 50-dimensional or any N-dimensional vector. It is very similar to FastText's get_word_vector or scikit-learn's TfidfVectorizer.

Basically, I found the supervised LSTM Model from ethnicolr's notebook and I am trying to convert it into unsupervised or autoencoder model.

Here is the detail of the model. The input is the padding sequence of the bi-gram character of the person name.

Input:

person_name = [Heynis, Noordewier-Reddingius, De Quant, Ahanfouf, Falaturi ,...]

### Convert person name to sequence with post padding
X_train = array([[101,  25, 180,  95, 443,   9, 343, 198,  38,  84,  37,   0,   0,   0,   0,   0,   0],
       [128,  27,   8,   6,  22,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [142, 350, 373,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [999,  14,  33,  16, 512,  36,  52, 352,  14,  33,   5, 211, 143,   0,   0,   0,   0],
       [146,  54,  99,  72, 102,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       ...]

Model:

model = Sequential()
model.add(Embedding(num_words, 32, input_length=feature_len))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(num_classes, activation='softmax'))

Ideally this is what I am looking for:

inputs = Input(shape=(feature_len,  ))
embedded = Embedding(num_words, 32)(inputs)
encoded = LSTM(50, dropout=0.2, recurrent_dropout=0.2)(embedded)

decoded = LSTM()(encoded)
decoded_inverse_embedded= Inverse_Embedding()(decoded )   # I know it's silly.
outputs = Layer_something()   # to convert back to the original shape

autoencoder_model= Model(inputs, outputs)
encoder = Model(inputs, encoded)   # This is what I want, ultimately.

autoencoder_model.fit(X_train, X_train) 

Here is what I tried: I got the code from https://stackoverflow.com/a/59576475/3015105. It seems like the training data was reshaped before it was fed to the model so no need for the Embedding layer. The RepeatVector and TimeDistributed layers were used for reshaping the output. The model seems right to me but I am not sure if this reshape and TimeDistributed layer are similar to Embedding layer or not.

sequence = X_train.reshape((len(X_train), feature_len, 1))

#define encoder
visible = Input(shape=(feature_len, 1))
encoder = LSTM(50, activation='relu')(visible)

# define reconstruct decoder
decoder1 = RepeatVector(feature_len)(encoder)
decoder1 = LSTM(50, activation='relu', return_sequences=True)(decoder1)
decoder1 = TimeDistributed(Dense(1))(decoder1)

myModel = Model(inputs=visible, outputs=decoder1)

myModel.fit(sequence, sequence, epochs=400)

The result does not seem correct. Are there another approaches to this problem? I have tried both FastText (via gensim) and TF-IDF model and I am curious if this model would be better.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
devon
  • 321
  • 2
  • 10
  • With only one class i.e name it won't do well. Try adding more classes, like location, non human names. Then pass the classes as target to the output layer. Perhaps that could lead to better resutls. – Bharath M Shetty Mar 16 '20 at 05:42
  • @Bharath But I am only interested in the vector representation of the person name. – devon Mar 16 '20 at 13:48

0 Answers0