I'm building an AI assistant that interacts with custom Q&A stored in a vector database.
All examples of it shows as a very simple task of chunking documents (QA in this case), creating embeddings, storing them in a vector DB, and then querying when searching...
However, the OpenAI embedding is not giving me good results when it comes to Q&A in Spanish, specifically when trying semantic search. For example, if I have a pair of Q&A that talks about "mar" (sea in English), but then I query for "Ocean," it should be close to the "mar" embeddings, but that is not the case.
What is the workflow to create good embeddings for Spanish? Do I have to preprocess the Q&A text before creating the embeddings? Is there a better model than OpenAI to do this? I have search a lot of it but all tutorial are for english. I think that the answer to spanish could apply for other languages too.