How to design a high performance Key Matching algorithm for a Translation Memory/Cache?

Question

Recently I've been assigned to build a translation memory for a new project. The idea is the TM is a cache layer on top of the RPC layer which will call the Google Translate API to translate if there is no match in the TM. I consider using the source text as key in TM and I need a fuzzy matching algorithm to match a query text with key in TM. If the result is higher than some threshold like 0.85 (range is 0 to 1) the cached translated text will be used instead of calling google service.

I've read a lot of articles/blogs/papers, but still don't know where to start. TD-IDF+cosine similarity seems not good enough? Levenshtein distance? What about semantic similarity? But how?

I read about this In the comments @mbatchkarov seems provide a correct direction.

Does anyone has similar experience on the subject? Any suggestions are welcome.

i've tried lucene but the best method i've heard of is http://www.wordfast.com/products_vltm.html , i can't tell you much since it's `swore to secrecy` but if you knew how it works, it's pretty much magical. — alvas, Feb 28 '14 at 08:49

score 1 · Accepted Answer · answered Feb 21 '14 at 11:40

A lot of the time the accepted answer to the question you linked to can get you quite far. You can compare the word (lemma) overlap between a query and all queries in the cache. To improve performance, you can incorporate word similarity to help you link semantically similar words. The thesaurus-building software I linked to in my is BSD-licensed, so you are free to use it as you see fit. If you need any help using it, the developers (disclaimer: I am a part of the team) will be happy to help out. In fact, I've got a few pre-built thesauri lying around. These should probably be a part of the software, but they are too large to upload to github.

Whichever approach you go for, be aware that there will be many cases where this does not work well. This is because the approaches discussed in that question are about semantic similarity, and your application may require semantic equivalence. For example, "I like big ginger cats" and "We like big ginger cats" or "We like small ginger cats" are very similar in meaning, but it would be wrong to use the translation of one as a translation of the other.

Thanks. You're absolutely right. What I want is semantic equivalence. Any suggestions on that? — Nemesis, Feb 28 '14 at 04:07
At first I wrote a long comment with pointers on how you might go about implementing a translation memory yourself. However, this is going to take ages and may not work very well. The point of a TM is that human-provided translations are expensive, so you want to save on that. Google translate is cheap. From a business point of view you are better off 1) not using a TM at all, just pay Google to translate everything on the fly 2) buying a commercial TM system and integrating into your app, or 3) using a free TM. It shouldn't be too hard to estimate the cost of each of these options. — mbatchkarov, Feb 28 '14 at 09:55

How to design a high performance Key Matching algorithm for a Translation Memory/Cache?

1 Answers1