As stated in the official OpenAI article:
To further explore tokenization, you can use our interactive Tokenizer
tool, which allows you to calculate the number of tokens and see how
text is broken into tokens. Alternatively, if you'd like to tokenize
text programmatically, use Tiktoken as a fast BPE tokenizer
specifically used for OpenAI models. Other such libraries you can
explore as well include transformers package for Python or the
gpt-3-encoder package for NodeJS.
A tokenizer can split the text string into a list of tokens, as stated in the official OpenAI example on counting tokens with Tiktoken:
Tiktoken is a fast open-source tokenizer by OpenAI.
Given a text string (e.g., "tiktoken is great!"
) and an encoding
(e.g., "cl100k_base"
), a tokenizer can split the text string into a
list of tokens (e.g., ["t", "ik", "token", " is", " great", "!"]
).
Splitting text strings into tokens is useful because GPT models see
text in the form of tokens. Knowing how many tokens are in a text
string can tell you:
- whether the string is too long for a text model to process and
- how much an OpenAI API call costs (as usage is priced by token).
Tiktoken supports 3 encodings used by OpenAI models (source):
Encoding name |
OpenAI models |
cl100k_base |
gpt-4 , gpt-3.5-turbo , text-embedding-ada-002 |
p50k_base |
text-davinci-003 , text-davinci-002 |
r50k_base |
GPT-3 models (text-curie-001 , text-babbage-001 , text-ada-001 , davinci , curie , babbage , ada ) |
For cl100k_base
and p50k_base
encodings:
For r50k_base
encodings, tokenizers are available in many languages:
Note that gpt-3.5-turbo
and gpt-4
use tokens in the same way as other models as stated in the official OpenAI documentation:
Chat models like gpt-3.5-turbo
and gpt-4
use tokens in the same way as
other models, but because of their message-based formatting, it's more
difficult to count how many tokens will be used by a conversation.
If a conversation has too many tokens to fit within a model’s maximum
limit (e.g., more than 4096 tokens for gpt-3.5-turbo
), you will have
to truncate, omit, or otherwise shrink your text until it fits. Beware
that if a message is removed from the messages input, the model will
lose all knowledge of it.
Note too that very long conversations are more likely to receive
incomplete replies. For example, a gpt-3.5-turbo
conversation that is
4090 tokens long will have its reply cut off after just 6 tokens.
How do I use tiktoken?
Install or upgrade tiktoken: pip install --upgrade tiktoken
You have two options.
OPTION 1: Search in the table above for the correct encoding for a given OpenAI model
If you run get_tokens_1.py
, you'll get the following output:
9
get_tokens_1.py
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
print(num_tokens_from_string("Hello world, let's test tiktoken.", "cl100k_base"))
OPTION 2: Use tiktoken.encoding_for_model()
to automatically load the correct encoding for a given OpenAI model
If you run get_tokens_2.py
, you'll get the following output:
9
get_tokens_2.py
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.encoding_for_model(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
print(num_tokens_from_string("Hello world, let's test tiktoken.", "gpt-3.5-turbo"))
Note: If you take a careful look at the usage
field in the OpenAI API response, you'll see that it reports 10
tokens used for an identical message. That's 1
token more than Tiktoken. I still haven't figured out why. I tested this in the past (see my past answer). As @Jota mentioned in the comment below, there still seems to be a mismatch between the token usage reported by the OpenAI API response and Tiktoken.