-1

I have an OpenAI embedding generated from their API.

I see examples of putting that vector into Postgres or Sqlite and then running a query against it.

I'm looking for simple code in python where I can use a text string and see how close the cosine distance for that text. I believe that cosine distance is used in databases because it is simpler to calculate: would using Euclidean distance be a more accurate estimate of the "closeness" of the string? If there is a better distance function to run I'm interested in seeing that as well.

Rok Benko
  • 14,265
  • 2
  • 24
  • 49
xrd
  • 4,019
  • 6
  • 30
  • 39

1 Answers1

2

Regarding cosine similarity calculation

See my past answer, especially the following part (i.e., STEP 3):

We need to calculate an embedding vector for the input so that we can compare the input with a given "fact" and see how similar these two texts are. Actually, we compare the embedding vector of the input with the embedding vector of the "fact". Then we compare the input with the second "fact", the third "fact", the fourth "fact", etc. Run get_cosine_similarity.ipynb.

get_cosine_similarity.ipynb

import openai
from openai.embeddings_utils import cosine_similarity
import pandas as pd
import os

openai.api_key = os.getenv('OPENAI_API_KEY')

my_model = 'text-embedding-ada-002'
my_input = '<INSERT_INPUT_HERE>'

def get_embedding(model: str, text: str) -> list[float]:
    result = openai.Embedding.create(
      model = my_model,
      input = my_input
    )
    return result['data'][0]['embedding']

input_embedding_vector = get_embedding(my_model, my_input)

df = pd.read_csv('companies_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, input_embedding_vector))
df

The code above will take the input and compare it with the first fact. It will save the calculated similarity of the two in a new column named 'similarity'. Then it will take the second fact, the third fact, the fourth fact, etc.

Regarding which distance function to choose

As stated in the official OpenAI documentation:

Which distance function should I use?

We recommend cosine similarity. The choice of distance function typically doesn’t matter much.

OpenAI embeddings are normalized to length 1, which means that:

  • Cosine similarity can be computed slightly faster using just a dot product
  • Cosine similarity and Euclidean distance will result in the identical rankings
Rok Benko
  • 14,265
  • 2
  • 24
  • 49