I am writing an algorithm that scans over a code base and uses the "text-embedding-ada-002" model from open ai, to turn each code line into a vector. Then I also use the same model to embed a natural language query e.g. "Where is the search functionality housed ?". Then I simply run a linear search and compare the cosine similarity between the query vector and each code line vector. Then I take the topk results.
The issue:
I notice that seemingly all code lines have high cosine similarity scores (greater than .6). I tried extending the window in which i embed 1 -> 10 lines at a time. But they all still seem high. Also it biases heavily towards natural language documents like the readme. Any ideas on how I can improve this ? Maybe some low hanging fruit that i'm missing ?