3

I have a homework which requires me to build an algorithm that can guess a missing word from a sentence. For example, when the input sentence is : " I took my **** for a walk this morning" , I want the output to guess the missing word(dog). My assignment requires me to train my own model from scratch. I built my corpus which has about 500.000 sentences. I cleaned the corpus. It is all lower-case and every sentence is seperated with a new line (\n) character. I also have vocabulary.txt file which lists all the unique words in descending order in frequency. The vocabulary file starts with the first 3 line 'S', '/S' and 'UNK' (these 3 tokens are surrounded with <> in vocabulary.txt but using <> in this website hides the characters between them for some reason) . I also have a small set of sentences with one missing word in every sentence which is denoted with [MASK], one sentence per line.

I followed the instructions in the https://github.com/allenai/bilm-tf ,which provides steps to train your own model from scratch using Elmo.

After gathering the data.txt and vocabulary file, I used the

python bin/train_elmo.py --train_prefix= <path to training folder> --vocab_file <path to vocab file> --save_dir <path where models will be checkpointed>`

and trained my corpus with tensorflow and CUDA enabled gpu.

After the training is finished, I used the following command:

python bin/dump_weights.py --save_dir /output_path/to/checkpoint --outfile/output_path/to/weights.hdf5

Which gave me the weights.hdf5 and options.json files. The only warning I received while training my model is :

WARNING : Error encountered when serializing lstm_output_embeddings.Type is unsupported, or the types of the items don't match field type in CollectionDef. 'list' object has no attribute 'name'

which was mentioned in the AllenAI repo as harmless. So it is safe to assume that the model training phase is finished correctly. My problem is, I have no idea what to do after this point. In this stackOverflow link Predicting Missing Words in a sentence - Natural Language Processing Model, the answer states that the following code can be used to predict the missing word:

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel,BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening,activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = '[CLS] I want to [MASK] the car because it is cheap . [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Create the segments tensors.
segments_ids = [0] * len(tokenized_text)

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# Predict all tokens
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)

predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

print(predicted_token)

Unfortunately, the code above is for Bert models. My assignment requires me to use Elmo models. I tried to find a library similiar to pytorch_pretrained_bert for Elmo but I couldn't find anything. What can I do to predict the masked words using my Elmo model?

Thanks.

  • Although they're both from sesame streets, Elmo works differently from Bert. Understand the model and why `[CLS]` and `[MASK]` exists from Bert and know how elmo is created. http://jalammar.github.io/illustrated-bert/ – alvas May 21 '19 at 22:20
  • I am aware that they are different. I mentioned the code about Bert just to better explain what I was trying to do with my Elmo model. Since Elmo also pays attention to context, it should be well suited for my assignment. – ihatepointers May 21 '19 at 23:10

0 Answers0