Recently I've been assigned to build a translation memory for a new project. The idea is the TM is a cache layer on top of the RPC layer which will call the Google Translate API to translate if there is no match in the TM. I consider using the source text as key in TM and I need a fuzzy matching algorithm to match a query text with key in TM. If the result is higher than some threshold like 0.85 (range is 0 to 1) the cached translated text will be used instead of calling google service.
I've read a lot of articles/blogs/papers, but still don't know where to start. TD-IDF+cosine similarity seems not good enough? Levenshtein distance? What about semantic similarity? But how?
I read about this In the comments @mbatchkarov seems provide a correct direction.
Does anyone has similar experience on the subject? Any suggestions are welcome.