4

There are quite a few tutorials on embeddings in OpenAI. I can't understand how they work.

Referring to https://platform.openai.com/docs/guides/embeddings/what-are-embeddings , an embedding is a vector or list. A string is passed to an embedding model and the model returns a number (in simplest terms). I can use this number(s).

If I use a simple string to get its embeddings, I get a massive list

result = get_embedding("I live in space", engine = "textsearchcuriedoc001mc")

result when printed

[5.4967957112239674e-05,
 -0.01301578339189291,
 -0.002223075833171606,
 0.013594076968729496,
 -0.027540158480405807,
 0.008867159485816956,
 0.009403547272086143,
 -0.010987567715346813,
 0.01919262297451496,
 0.022209804505109787,
 -0.01397960539907217,
 -0.012806257233023643,
 -0.027908924967050552,
 0.013074451126158237,
 0.024942029267549515,
 0.0200139675289392 , ..... -> truncated this much, much, much longer list 

Question 1 - how is this massive list correlated with my 4-word text?

Question 2 -

I create embeddings of the text I want to use in query. Note that it is exactly the same as the text of original content I live in space

queryembedding = get_embedding(
        'I live in space',
        engine="textsearchcuriequery001mc"
    )
queryembedding

When I run cosine similarity , the value is 0.42056650555103214.

similarity = cosine_similarity(embeddings_of_i_live,queryembedding)
similarity

I get value 0.42056650555103214

Shouldn't the value be 1 to indicate identical value?

Rok Benko
  • 14,265
  • 2
  • 24
  • 49
Manu Chadha
  • 15,555
  • 19
  • 91
  • 184

1 Answers1

4

Q1:

How is this massive list correlated with my 4-word text?

A1: Let's say you want to use the OpenAI text-embedding-ada-002 model. No matter what your input is, you will always get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space, which is very hard to imagine. Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined.


Q2:

I create embeddings of the text I want to use in the query. Note that it is exactly the same as the text of the original content: I live in space. When I run cosine similarity, the value is 0.42056650555103214. Should the value be 1 to indicate an identical value?

A2: Yes, the value should be 1 if you calculate cosine similarity between two identical texts. See an example here.

For an example of semantic search based on embeddings, see this answer.

Rok Benko
  • 14,265
  • 2
  • 24
  • 49
  • 1
    strangely, I get `cosine similarity` value as `0.42056650555103214` . I have updated the question – Manu Chadha Apr 21 '23 at 05:16
  • 1
    If you compare two identical pieces of text and you don't get a cosine similarity of 1 then something is wrong with your process. Are your vectors the same? If they are, your are calculating cosine similarity wrongly. – İan Boddison May 14 '23 at 00:18