I'm working on a system which allows imported files to be localized into other languages.
This is mostly a private project to get the hang of MVC3, EntityFramework, LINQ, etcetera. Therefore I like doing some crazy things to spice up the end result, one of those things would be the recognition of similar strings.
Imagine you have the following list of strings - borrowed from a game I've worked with in the past:
- Megabeth: Holy Roller Uniform - Includes Head, Torso, and Legs
- Megabeth: Holy Roller Uniform Head
- Megabeth: Holy Roller Uniform Legs
- Megabeth: Holy Roller Uniform Torso
- Megabeth: PAX East 2012 Uniform - Includes Head, Torso, and Legs
- Megabeth: PAX East 2012 Uniform Head
- Megabeth: PAX East 2012 Uniform Legs
- Megabeth: PAX East 2012 Uniform Torso
As you can see, once users have translated the first 4 strings, the following 4 share a lot of similarities, in this case:
- Megabeth
- Uniform
- Includes Head, Torso, and Legs
- Head
- Legs
- Torso
Consider the first 4 strings are indeed already translated, when a user selects the 5th string from the list, what kind of algorithm or technique can I use to show the user the 1st string (and potentially others) under a sub-header of "Similar strings"?
Edit - A little comment on the Levenshtein Distance: I'm currently targeting 10k strings in the database. Levenshtein Distance compares string per string, so in this case 10k x (10k -1) possible combinations. How would I approach this in a feasible way? Is there a better solution that this particular algorithm?