Vector based information retrieval on code resulting in high correlation values for all candidates

Asked Mar 09 '23 at 21:14

Active Mar 10 '23 at 02:18

Viewed 29 times

I am writing an algorithm that scans over a code base and uses the "text-embedding-ada-002" model from open ai, to turn each code line into a vector. Then I also use the same model to embed a natural language query e.g. "Where is the search functionality housed ?". Then I simply run a linear search and compare the cosine similarity between the query vector and each code line vector. Then I take the topk results.

The issue:

I notice that seemingly all code lines have high cosine similarity scores (greater than .6). I tried extending the window in which i embed 1 -> 10 lines at a time. But they all still seem high. Also it biases heavily towards natural language documents like the readme. Any ideas on how I can improve this ? Maybe some low hanging fruit that i'm missing ?

edited Mar 10 '23 at 02:18

desertnaut

57,590
26
140
166

asked Mar 09 '23 at 21:14

hman

Please provide examples of the code that you use. – sophros Mar 09 '23 at 22:04
Imho I wouldn't expect any good result by representing code lines with text embeddings, it's like using the wrong language representation. Btw cosine values should not be interpreted as absolute values, they should always be compared relatively to a reference. – Erwan Mar 10 '23 at 11:03

Vector based information retrieval on code resulting in high correlation values for all candidates

0 Answers0