There are quite a few tutorials on embeddings
in OpenAI. I can't understand how they work.
Referring to https://platform.openai.com/docs/guides/embeddings/what-are-embeddings , an embedding
is a vector
or list
. A string is passed to an embedding
model and the model returns a number (in simplest terms). I can use this number(s).
If I use a simple string to get its embeddings
, I get a massive list
result = get_embedding("I live in space", engine = "textsearchcuriedoc001mc")
result
when printed
[5.4967957112239674e-05,
-0.01301578339189291,
-0.002223075833171606,
0.013594076968729496,
-0.027540158480405807,
0.008867159485816956,
0.009403547272086143,
-0.010987567715346813,
0.01919262297451496,
0.022209804505109787,
-0.01397960539907217,
-0.012806257233023643,
-0.027908924967050552,
0.013074451126158237,
0.024942029267549515,
0.0200139675289392 , ..... -> truncated this much, much, much longer list
Question 1 - how is this massive list correlated with my 4-word text?
Question 2 -
I create embeddings
of the text I want to use in query. Note that it is exactly the same as the text of original content I live in space
queryembedding = get_embedding(
'I live in space',
engine="textsearchcuriequery001mc"
)
queryembedding
When I run cosine similarity
, the value is 0.42056650555103214
.
similarity = cosine_similarity(embeddings_of_i_live,queryembedding)
similarity
I get value 0.42056650555103214
Shouldn't the value be 1
to indicate identical value?