Recognizing similarity in strings

Question

I'm working on a system which allows imported files to be localized into other languages.

This is mostly a private project to get the hang of MVC3, EntityFramework, LINQ, etcetera. Therefore I like doing some crazy things to spice up the end result, one of those things would be the recognition of similar strings.

Imagine you have the following list of strings - borrowed from a game I've worked with in the past:

Megabeth: Holy Roller Uniform - Includes Head, Torso, and Legs
Megabeth: Holy Roller Uniform Head
Megabeth: Holy Roller Uniform Legs
Megabeth: Holy Roller Uniform Torso
Megabeth: PAX East 2012 Uniform - Includes Head, Torso, and Legs
Megabeth: PAX East 2012 Uniform Head
Megabeth: PAX East 2012 Uniform Legs
Megabeth: PAX East 2012 Uniform Torso

As you can see, once users have translated the first 4 strings, the following 4 share a lot of similarities, in this case:

Megabeth
Uniform
Includes Head, Torso, and Legs
Head
Legs
Torso

Consider the first 4 strings are indeed already translated, when a user selects the 5th string from the list, what kind of algorithm or technique can I use to show the user the 1st string (and potentially others) under a sub-header of "Similar strings"?

Edit - A little comment on the Levenshtein Distance: I'm currently targeting 10k strings in the database. Levenshtein Distance compares string per string, so in this case 10k x (10k -1) possible combinations. How would I approach this in a feasible way? Is there a better solution that this particular algorithm?

Interesting question. I don't know where to begin to answer this but ill hang out and watch. — Grahame A, Oct 22 '12 at 20:19
Edit distance. which has many varieties. and fairly straight forward. can be computationally expensive if your matrix grow large. — DarthVader, Oct 22 '12 at 20:22
You could concat all the strings, then split by white space (using regex) then linq it with `.Distint()` and perform a translation with replace. The problem with this is, not all languages translate word for word. — Jay, Oct 22 '12 at 20:22
@Jay That's okay, it's supposed to aid the user in his translation process, not do all the work for him... yet at least :p — Lennard Fonteijn, Oct 22 '12 at 20:26

score 5 · Accepted Answer · answered Oct 22 '12 at 20:20

5

You could look into the Levenshtein Distance. Those below a certain threshold will be considered similar. Two strings that are identical will have a distance of zero.

There's a C# implementation, amongst other languages, on Rosetta Code.

answered Oct 22 '12 at 20:20

keyboardP

68,824
13
156
205

+1, Was just going to recommend Levenshtein, you beat me to it – CaffGeek Oct 22 '12 at 20:22
I've come across that algorithm indeed but frankly forgot the name, thanks. I'm curious to more answers, so I'm leaving this open for a bit ;) – Lennard Fonteijn Oct 22 '12 at 20:23
That's fine, I'm also interested to see if someone else has another solution :) – keyboardP Oct 22 '12 at 20:24
A little more information: I'm currently targeting 10k strings in the database. Levenshtein Distance compares string per string, so in this case 10k x (10k -1) possible combinations. How would I approach this in a feasible way? – Lennard Fonteijn Oct 22 '12 at 20:34
@LennardFonteijn what database? – MK. Oct 22 '12 at 20:37
@LennardFonteijn I'm not sure of benchmarks, so you could try a test run with a few thousand to measure performance. Alternatively, `Levenshtein Automaton` may be a starting point, although I haven't had any experience with it. http://en.wikipedia.org/wiki/Levenshtein_automata – keyboardP Oct 22 '12 at 20:42
It's a Microsoft SQL server. By feasible I mean how to approach this, do I need a separate table for the matches, do I calculate the distance on the fly, etcetera. – Lennard Fonteijn Oct 22 '12 at 21:44
@LennardFonteijn - You can perform it in T-SQL http://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql When the user selects a string, you compare that one string with the all the other strings in the database. Then return only the values of those that are below a certain threshold. – keyboardP Oct 22 '12 at 22:10
@keyboardP How does that perform when hundreds of concurrent users are translating strings? Seems pretty heavy on the database to perform that operation every single web-request (don't mind caching for now). – Lennard Fonteijn Oct 22 '12 at 22:40
@LennardFonteijn I'm not an expert on SQL, so someone else may have to help out there. However, it might be possible to cache all values below the threshold every hour or so (depending how often it updates). Any new strings can then be compared to the cache rather than the database itself. – keyboardP Oct 22 '12 at 22:44
Have a look at [amaGama](https://github.com/translate/amagama) - it does what you are wanting. The online server hosts all FOSS localisation, considerably greater then 10,000 strings and performs fine. Its written in Python and uses PosgreSQL so might not be exactly what you want. – Dwayne Oct 25 '12 at 08:23

MK. · Answer 2 · 2012-10-22T20:39:08.897

This will depend on the size of the data and how rich the vocabulary is. Here's the first thought: build a map of words to strings then another map of word pairs to strings and perhaps if data is not huge map of string triplets to strings. Remove mappings that point at a single string (this will dramatically reduce the number of triplet mappings). Save resulting dictionary on disk or in a database if building it takes time.

Now given a string you should be able to quickly split it into words, word pairs and triplets and look up all strings related to it. You will need to play with giving weight to a triplet matching vs 4 words matching. I.e. is "I am an old man" closer to "an old man ate a carrot" or "man killed the old dog with an arrow" (sounds like triplet match is more important).

UPDATE: If this in a Microsoft SQL Server database you can play with full text search feature. I never tried it though. You should also take a look at Lucene.

Recognizing similarity in strings

2 Answers2