34

OpenAI's text models have a context length, e.g.: Curie has a context length of 2049 tokens. They provide max_tokens and stop parameters to control the length of the generated sequence. Therefore the generation stops either when stop token is obtained, or max_tokens is reached.

The issue is: when generating a text, I don't know how many tokens my prompt contains. Since I do not know that, I cannot set max_tokens = 2049 - number_tokens_in_prompt.

This prevents me from generating text dynamically for a wide range of text in terms of their length. What I need is to continue generating until the stop token.

My questions are:

  • How can I count the number of tokens in Python API? So that I will set max_tokens parameter accordingly.
  • Is there a way to set max_tokens to the max cap so that I won't need to count the number of prompt tokens?
Rok Benko
  • 14,265
  • 2
  • 24
  • 49
meliksahturker
  • 922
  • 2
  • 11
  • 20

3 Answers3

39

As stated in the official OpenAI article:

To further explore tokenization, you can use our interactive Tokenizer tool, which allows you to calculate the number of tokens and see how text is broken into tokens. Alternatively, if you'd like to tokenize text programmatically, use Tiktoken as a fast BPE tokenizer specifically used for OpenAI models. Other such libraries you can explore as well include transformers package for Python or the gpt-3-encoder package for NodeJS.

A tokenizer can split the text string into a list of tokens, as stated in the official OpenAI example on counting tokens with Tiktoken:

Tiktoken is a fast open-source tokenizer by OpenAI.

Given a text string (e.g., "tiktoken is great!") and an encoding (e.g., "cl100k_base"), a tokenizer can split the text string into a list of tokens (e.g., ["t", "ik", "token", " is", " great", "!"]).

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you:

  • whether the string is too long for a text model to process and
  • how much an OpenAI API call costs (as usage is priced by token).

Tiktoken supports 3 encodings used by OpenAI models (source):

Encoding name OpenAI models
cl100k_base gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base text-davinci-003, text-davinci-002
r50k_base GPT-3 models (text-curie-001, text-babbage-001, text-ada-001, davinci, curie, babbage, ada)

For cl100k_base and p50k_base encodings:

For r50k_base encodings, tokenizers are available in many languages:

Note that gpt-3.5-turbo and gpt-4 use tokens in the same way as other models as stated in the official OpenAI documentation:

Chat models like gpt-3.5-turbo and gpt-4 use tokens in the same way as other models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.

If a conversation has too many tokens to fit within a model’s maximum limit (e.g., more than 4096 tokens for gpt-3.5-turbo), you will have to truncate, omit, or otherwise shrink your text until it fits. Beware that if a message is removed from the messages input, the model will lose all knowledge of it.

Note too that very long conversations are more likely to receive incomplete replies. For example, a gpt-3.5-turbo conversation that is 4090 tokens long will have its reply cut off after just 6 tokens.

How do I use tiktoken?

  1. Install or upgrade tiktoken: pip install --upgrade tiktoken

  2. You have two options.

OPTION 1: Search in the table above for the correct encoding for a given OpenAI model

If you run get_tokens_1.py, you'll get the following output:

9

get_tokens_1.py

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

print(num_tokens_from_string("Hello world, let's test tiktoken.", "cl100k_base"))

OPTION 2: Use tiktoken.encoding_for_model() to automatically load the correct encoding for a given OpenAI model

If you run get_tokens_2.py, you'll get the following output:

9

get_tokens_2.py

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.encoding_for_model(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

print(num_tokens_from_string("Hello world, let's test tiktoken.", "gpt-3.5-turbo"))

Note: If you take a careful look at the usage field in the OpenAI API response, you'll see that it reports 10 tokens used for an identical message. That's 1 token more than Tiktoken. I still haven't figured out why. I tested this in the past (see my past answer). As @Jota mentioned in the comment below, there still seems to be a mismatch between the token usage reported by the OpenAI API response and Tiktoken.

Rok Benko
  • 14,265
  • 2
  • 24
  • 49
1

With the information contained in the comments, I made this: https://gist.github.com/buanzo/7cdd2c34fc0bb25c71b857a16853c6fa

It is a count_tokens implementation that tries tiktoken, nltk and fallbacks to .split()

It includes a simple TokenBuffer implementation as well.

We can import the count_tokens function from the token_counter module and call it with our text string as follows:

from token_counter import count_tokens
text = "The quick brown fox jumps over the lazy dog."
result = count_tokens(text, debug=True)
print(result)

If all the required libraries are available the result is better but even without tiktoken nor nltk, the function should return a dictionary with the number of tokens and the method used to count them. For example:

{'n_tokens': 9, 'method': 'tiktoken'}

Arturo
  • 21
  • 2
1

Here is how I do it with Python 3. Then you can pass the model name or the encoding string. You can get the encoding, the tokens or the token count.

token_helper.py:

import tiktoken

def encoding_getter(encoding_type: str):
    """
    Returns the appropriate encoding based on the given encoding type (either an encoding string or a model name).
    """
    if "k_base" in encoding_type:
        return tiktoken.get_encoding(encoding_type)
    else:
        return tiktoken.encoding_for_model(encoding_type)

def tokenizer(string: str, encoding_type: str) -> list:
    """
    Returns the tokens in a text string using the specified encoding.
    """
    encoding = encoding_getter(encoding_type)
    tokens = encoding.encode(string)
    return tokens

def token_counter(string: str, encoding_type: str) -> int:
    """
    Returns the number of tokens in a text string using the specified encoding.
    """
    num_tokens = len(tokenizer(string, encoding_type))
    return num_tokens

Works like this

>>> import token_helper
>>> token_helper.token_counter("This string will be counted as tokens", "gpt-3.5-turbo"))
7
Timothy Alexis Vass
  • 2,526
  • 2
  • 11
  • 30