Regarding cosine similarity calculation
See my past answer, especially the following part (i.e., STEP 3):
We need to calculate an embedding vector for the input so that we can
compare the input with a given "fact" and see how similar these two
texts are. Actually, we compare the embedding vector of the input with
the embedding vector of the "fact". Then we compare the input with the
second "fact", the third "fact", the fourth "fact", etc. Run
get_cosine_similarity.ipynb
.
get_cosine_similarity.ipynb
import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
import os
openai.api_key = os.getenv('OPENAI_API_KEY')
my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT_HERE>'
def get_embedding(model: str, text: str) -> list[float]:
result = openai.Embedding.create(
model = my_model,
input = my_input
)
return result['data'][0]['embedding']
input_embedding_vector = get_embedding(my_model, my_input)
df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
df
The code above will take the input and compare it with the first fact.
It will save the calculated similarity of the two in a new column
named 'similarity'
. Then it will take the second fact, the third
fact, the fourth fact, etc.
Regarding which distance function to choose
As stated in the official OpenAI documentation:
Which distance function should I use?
We recommend cosine similarity.
The choice of distance function typically doesn’t matter much.
OpenAI embeddings are normalized to length 1, which means that:
- Cosine similarity can be computed slightly faster using just a dot
product
- Cosine similarity and Euclidean distance will result in the
identical rankings