9

I have the sentence below :

I want to ____ the car because it is cheap.

I want to predict the missing word ,using an NLP model. What NLP model shall I use? Thanks.

alyssaeliyah
  • 2,214
  • 6
  • 33
  • 80

2 Answers2

23

TL;DR

Try this out: https://github.com/huggingface/pytorch-pretrained-BERT

First you have to set it up, properly with

pip install -U pytorch-pretrained-bert

Then you can use the "masked language model" from the BERT algorithm, e.g.

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = '[CLS] I want to [MASK] the car because it is cheap . [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Create the segments tensors.
segments_ids = [0] * len(tokenized_text)

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# Predict all tokens
with torch.no_grad():
    predictions = model(tokens_tensor, segments_tensors)

masked_index = tokenized_text.index("[MASK]")
    
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

print(predicted_token)

[out]:

buy

In Long

To truly understand why you need the [CLS], [MASK] and segment tensors, please do read the paper carefully, https://arxiv.org/abs/1810.04805

And if you're lazy, you can read this nice blogpost from Lilian Weng, https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html

Other than BERT, there are a lot of other models that can perform the task of filling in the blank. Do look at the other models in the pytorch-pretrained-BERT repository, but more importantly dive deeper into the task of "Language Modeling", i.e. the task of predicting the next word given a history.

Austin Richardson
  • 8,078
  • 13
  • 43
  • 49
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 4
    Straight forward answer but you forgot to create the `masked_index` leading to an error near the end. I am assuming the `masked_index` is just the index of the `[MASK]`? – pilu Mar 15 '19 at 10:15
  • Masking in BERT is not padding (that usually happens at the start/end of the sequence) =) Read the BERT paper for more information. – alvas Mar 15 '19 at 16:13
  • 5
    Please add the following code ``` masked_index = tokenized_text.index('[MASK]') ``` – Biranchi Aug 14 '19 at 09:13
  • 3
    Another correction: Since you have a 3d tensor, you should actually do: "predicted_index = torch.argmax(predictions[0][0][masked_index]).item()" Also, consider @Biranchi comment. All working this way! – Tiago Duque Nov 19 '19 at 14:59
  • 1
    The new version of hugginface library is called `transformers` so instead of installing `pytorch-uncased-bert` one could do `pip install transformers` to get the latest version. The second import line becomes `from transformers import ....` – kitsiosk Aug 07 '20 at 13:59
3

There are numerous models you might be able to use. But I think the most recently being used model for such sequence learning problems, are bidirectional RNNs( like Bidirectional LSTM), you can get a hint from here

But be advised, Bidirectional RNNs are very expensive to train. Depending on your problem to solve, I highly advice to use some pre-trained model. Good luck!

alift
  • 1,855
  • 2
  • 13
  • 28