1

How to generate a list of tokens that are most likely to occupy the place of a missing token in a given sentence?

I've found this StackOverflow answer, however, this only generates a possible word, and not a list of words that fits the sentence. I tried printing out every variable to see if he might have generated all the possible words, but no luck.

For example,

>>> sentence = 'Cristiano Ronaldo dos Santos Aveiro GOIH ComM is a Portuguese professional [].' # [] is missing word
>>> generate(sentence)
['soccer', 'basketball', 'tennis', 'rugby']
Sonav
  • 41
  • 1
  • 5
  • For the generated word in the answer referenced in your question, there may be a way to assign the word to a variable and/or append it to a list. Did you try this approach? – etch_45 Nov 19 '20 at 04:21
  • @etch_45 I've tried, but I don't think I did it right. It will be great if you can maybe give me some approach. – Sonav Nov 19 '20 at 04:46
  • If you can edit the question with the code tried you've worked on - that would be helpful for review and debugging. – etch_45 Nov 19 '20 at 04:50
  • @etch_45 I tried, but it didn't seem to work. – Sonav Nov 19 '20 at 04:59

2 Answers2

1

You can essentially do the same as in this answer, but instead of adding just the best fitting token, take for example the five most fitting tokens:

def fill_the_gaps(text):
    text = '[CLS] ' + text + ' [SEP]'
    tokenized_text = tokenizer.tokenize(text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [0] * len(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])
    with torch.no_grad():
        predictions = model(tokens_tensor, segments_tensors)
    results = []
    for i, t in enumerate(tokenized_text):
        if t == '[MASK]':
            #instead of argmax, we use argsort to sort the tokens which best fit
            predicted_index = torch.argsort(predictions[0, i], descending=True)
            tokens = []
            #the the 5 best fitting tokens and add the to the list
            for k in range(5):
                 predicted_token = tokenizer.convert_ids_to_tokens([predicted_index[k].item()])[0]
                tokens.append(predicted_token)
            results.append(tokens)
    return results

For you sentence, this results in : [['footballer', 'golfer', 'football', 'cyclist', 'boxer']]

chefhose
  • 2,399
  • 1
  • 21
  • 32
0

I've just tried out your example on the model hub of HuggingFace with the BERT-base-uncased model, and it generates a list of possible tokens:

enter image description here

I could write out a Colab notebook to explain how to code this up. Each neural network always outputs a probability distribution, so you can just return the tokens with the highest probability.

Niels
  • 1,023
  • 1
  • 16
  • 13