0

I want to perform permissive/lenient string comparison in JavaScript like this:

Morocco = Moroco = Moroko = Morokko = Marocco = Maroco
Russia = Rusia
US = USA
Bucharest = Buharest
Afghanistan = Afganistan
Bangkok = Bankok
etc..

These comparisons will be used when operating with third-party APIs. I won't make any choices in my application based on them, but my goal is to provide the best options to the user. And the user will decide what's OK for him.

Would you point me the right way? The only idea comes to my mind is use character checksums and compare them. Maybe there could be better approach?

It would be nice also to get a "match integer" like:

var n = compare("Morocco", "Marocco"); // n = 95
var m = compare("Morocco", "Marokko"); // n = 85

but how to do that?

Thanks.

Roman Pushkin
  • 5,639
  • 3
  • 40
  • 58
  • 1
    I wish I could tell you to just use Levenshtein distance but that's not how it works. I think your easiest/best bet is to just keep a list of common synonyms for each term you support. – Benjamin Gruenbaum Dec 16 '13 at 16:50
  • you can search Wikipedia suggestions for both and compare the search results... – dandavis Dec 16 '13 at 16:56
  • 1
    @dandavis that's actually quite an interesting idea! I'm sure you can use other services who are already _aware_ of the different (or will auto-correct it) like Google might be an interesting approach too. – Benjamin Gruenbaum Dec 16 '13 at 16:58
  • yeah, i was thinking google, but they might have strict usage or login requirements, and i know wikipedia runs a jsonp suggestion service: http://en.wikipedia.org/w/api.php?action=opensearch&search=usa&namespace=0 – dandavis Dec 16 '13 at 17:05

3 Answers3

2

I doubt you'll get something with checksums.

If you don't want a library but just a simple algorithm, you might compute the Levenshtein distance. It's probably the best solution among the simple no-dictionary ones.

If you want something morepowerful, I'd suggest you to start studying approximate string matching and search relevant libraries.

Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
  • Doubt this is the best solution, because America == USA (which was not mentioned, but completely understandable that it ought to be) – Brian Dec 16 '13 at 16:52
  • @Brian This isn't, of course, the "best" solution. But maybe the best among the cheap ones. Regarding "America == USA", it depends on OP goals, this would be terrible if the goal is to fix typing errors for example, and OP's examples look like he mainly has to deal with mispelling. – Denys Séguret Dec 16 '13 at 17:00
2

Your best bet for something like this is going to be using a spell check library. This library (http://www.javascriptspellcheck.com/) is one example that might work. Looking further at that particular API, you can read suggestions via AJAX as follows:

o = $Spelling.AjaxSpellCheckFields(Fields)
o.onValidate = function(result) { }

I'm sure there are other excellent libraries out there that should perform similar operations.

In terms of algorithms the basic idea is to compute the distance between what the user has entered and a list of words in a dictionary. I read suggestions that a "Bloom Filter" is a good option. For more information on that see "What algorithm gives suggestions in a spell checker?".

Overall, your algorithm needs to be able to handle the following inputs:

  • User entered characters - Obvious, but important
  • Past selections - basically over time certain mistakes will be common. Remembering the most commonly selected suggestions for any mistake or how the user corrects their own mistake can really improve the quality of your algorithm over time. This information could even be saved in a user specific manner
  • Context - If you know the user should be entering a country code your dictionary can shrink significantly which means you should be able to provide better suggestions

I think with a little more research this should put you on the right track. Best of luck!

Community
  • 1
  • 1
drew_w
  • 10,320
  • 4
  • 28
  • 49
  • Interesting! Although this wouldn't work for aliasing things like US and USA together. You need to understand the meaning and not just the spelling->correction mapping – Benjamin Gruenbaum Dec 16 '13 at 16:54
0

You should try to compare the values using a similarity algorithm like the Damerau–Levenshtein distance. Here's an implementation in javascript:

Sort an array by the "Levenshtein Distance" with best performance in Javascript

Community
  • 1
  • 1
Horen
  • 11,184
  • 11
  • 71
  • 113