embeddings and semantic search in spanish

Question

I'm building an AI assistant that interacts with custom Q&A stored in a vector database.

All examples of it shows as a very simple task of chunking documents (QA in this case), creating embeddings, storing them in a vector DB, and then querying when searching...

However, the OpenAI embedding is not giving me good results when it comes to Q&A in Spanish, specifically when trying semantic search. For example, if I have a pair of Q&A that talks about "mar" (sea in English), but then I query for "Ocean," it should be close to the "mar" embeddings, but that is not the case.

What is the workflow to create good embeddings for Spanish? Do I have to preprocess the Q&A text before creating the embeddings? Is there a better model than OpenAI to do this? I have search a lot of it but all tutorial are for english. I think that the answer to spanish could apply for other languages too.

score 0 · Answer 1 · answered Aug 01 '23 at 10:43

I ran into the same issue. OpenAI embeddings are imperfect, for example they're often good at logical similarity but not necessarily at semantic similarity (so, for example, two antonyms may have a high cosine similarity because they belong to the same topic, when you'd expect them to be far away because their respective meanings are opposite).

One way to solve this, although I haven't tried it personally, would be to follow openai's cookbook on the topic. In a nutshell, you'll provide labeled training examples and the ouput will be a matrix you can multiply your embeddings with. And hopefully after that the newly-computed embeddings will be able to better perform on your specific task with your specific data.

If you do try this approach, please let me know how it went!

embeddings and semantic search in spanish

1 Answers1