0

I am writing an algorithm that scans over a code base and uses the "text-embedding-ada-002" model from open ai, to turn each code line into a vector. Then I also use the same model to embed a natural language query e.g. "Where is the search functionality housed ?". Then I simply run a linear search and compare the cosine similarity between the query vector and each code line vector. Then I take the topk results.

The issue:

I notice that seemingly all code lines have high cosine similarity scores (greater than .6). I tried extending the window in which i embed 1 -> 10 lines at a time. But they all still seem high. Also it biases heavily towards natural language documents like the readme. Any ideas on how I can improve this ? Maybe some low hanging fruit that i'm missing ?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
hman
  • 49
  • 7
  • Please provide examples of the code that you use. – sophros Mar 09 '23 at 22:04
  • Imho I wouldn't expect any good result by representing code lines with text embeddings, it's like using the wrong language representation. Btw cosine values should not be interpreted as absolute values, they should always be compared relatively to a reference. – Erwan Mar 10 '23 at 11:03

0 Answers0