5

i have the following code

import transformers
from transformers import pipeline

# Load the language model pipeline
model = pipeline("text-generation", model="gpt2")

# Input sentence for generating next word predictions
input_sentence = "I enjoy walking in the"

I want to generate only the next word given the input sentence but i want to see list of all possible next words along with their probabilities. any other LLM can be used i put gpt2 as an example.

In the code i want to choose top 500 words or top 1000 words suggestion for only the next word and the probabilities of each suggested word how can i do this?

cronoik
  • 15,434
  • 3
  • 40
  • 78
datadigger
  • 101
  • 7

2 Answers2

5

We have to go more low-level, as the pipeline function is not appropriate for what you are trying to do.

After you pass your sequence to AutoModelForCausalLM, the last tensor in the output will contain the probabilities of every token in the vocabulary being the next token. In the code below, I call it next_token_candidates_tensor. After that, you simply need to select the indices of the topk candidates and decode them back to words.

import torch
from transformers import AutoModelForCausalLM , AutoTokenizer

class LMHeadModel:

    def __init__(self, model_name):
        # Initialize the model and the tokenizer.
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    def get_predictions(self, sentence):
        # Encode the sentence using the tokenizer and return the model predictions.
        inputs = self.tokenizer.encode(sentence, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(inputs)
            predictions = outputs[0]
        return predictions
    
    def get_next_word_probabilities(self, sentence, top_k=500):

        # Get the model predictions for the sentence.
        predictions = self.get_predictions(sentence)
        
        # Get the next token candidates.
        next_token_candidates_tensor = predictions[0, -1, :]

        # Get the top k next token candidates.
        topk_candidates_indexes = torch.topk(
            next_token_candidates_tensor, top_k).indices.tolist()

        # Get the token probabilities for all candidates.
        all_candidates_probabilities = torch.nn.functional.softmax(
            next_token_candidates_tensor, dim=-1)
        
        # Filter the token probabilities for the top k candidates.
        topk_candidates_probabilities = \
            all_candidates_probabilities[topk_candidates_indexes].tolist()

        # Decode the top k candidates back to words.
        topk_candidates_tokens = \
            [self.tokenizer.decode([idx]).strip() for idx in topk_candidates_indexes]

        # Return the top k candidates and their probabilities.
        return list(zip(topk_candidates_tokens, topk_candidates_probabilities))


sentence = "I enjoy walking in the"
model = LMHeadModel("gpt2")
model.get_next_word_probabilities(sentence, top_k=500)

# [('park', 0.15904344618320465),
# ('woods', 0.10028065741062164),
# ('streets', 0.0418376550078392),
# ('dark', 0.03117542900145054),
# ('door', 0.029618268832564354),
# ('street', 0.02388935722410679),
# ('rain', 0.021733922883868217),
# ...
Ruan
  • 772
  • 4
  • 13
  • this is great, thanks alot. general question. what is the limit for the top_k, does it generate probs for every possible word in english vocab? or there is a limit. if there is a limit does it depend on the type of the model we use? – datadigger Jun 06 '23 at 18:07
  • Yes, there is a limit for the `top_k`, and it depends on your model. That's because you're not generating probabilities specifically for English words, but rather for the tokens in the model's vocabulary. Therefore, the limit for `top_k` is equal to the length of the `next_token_candidates_tensor`, because each position in this tensor corresponds to one token in the model's vocabulary. – Ruan Jun 06 '23 at 18:26
  • thanks again Ruran. is there a model that can generate for >500k tokens ? for GPT2 limit was 50k. the limit is because it is zero prob for many token hence not printing and setting the limit? – datadigger Jun 06 '23 at 19:14
  • Most models of this kind will have around 50k tokens in their vocabulary, regardless of their size. For each token you receive from calling `get_next_word_probabilities`, add it to the original sequence, and then call `get_next_word_probabilities` again to obtain the next list of tokens. Remember that tokens do not necessarily correspond to single English words (a word can be represented by multiple tokens). – Ruan Jun 07 '23 at 10:09
  • lets say we have 500k english words and if they get converted into tokens we will have even more number. so my understanding was the llms will provide probs for every single token that they have been trained on. isnt that wrong? why we have 50k limit? – datadigger Jun 07 '23 at 21:18
  • 1
    Maybe this document will make things clear for you: [Byte-Pair Encoding tokenization](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt) – Ruan Jun 08 '23 at 03:50
2

I think you do yourself a favor when you avoid the pipeline for this and just use the respective language modeling class. All you need to do is:

  1. to get the logits of the next token (gpt-2 uses tokens that are not necessarily words),
  2. apply the softmax to get the probabilities
  3. apply topk to retrieve the k most probable tokens.
import torch
from transformers import GPT2TokenizerFast, GPT2LMHeadModel

t = GPT2TokenizerFast.from_pretrained("gpt2")
m = GPT2LMHeadModel.from_pretrained("gpt2")

encoded_text = t("I enjoy walking in the", return_tensors="pt")

#1. step to get the logits of the next token
with torch.inference_mode():
  outputs = m(**encoded_text)

next_token_logits = outputs.logits[0, -1, :]
print(next_token_logits.shape)
print(next_token_logits)

# 2. step to convert the logits to probabilities
next_token_probs = torch.softmax(next_token_logits, -1)

# 3. step to get the top 10
topk_next_tokens= torch.topk(next_token_probs, 10)

#putting it together
print(*[(t.decode(idx), prob) for idx, prob in zip(topk_next_tokens.indices, topk_next_tokens.values)], sep="\n")

Output:

torch.Size([50257])
tensor([ -95.1139,  -93.7291,  -97.5711,  ...,  -98.0303, -100.2803,
         -96.1145])
(' park', tensor(0.1590))
(' woods', tensor(0.1003))
(' streets', tensor(0.0418))
(' dark', tensor(0.0312))
(' door', tensor(0.0296))
(' street', tensor(0.0239))
(' rain', tensor(0.0217))
(' city', tensor(0.0189))
(' same', tensor(0.0150))
(' halls', tensor(0.0135))
cronoik
  • 15,434
  • 3
  • 40
  • 78