0

I have a list of colleges (School model) in a database and I have user input that should decide which School to link User to.

The problem is, humans are fallible. So instead of University of Miami, they could type Universiti of Miami or Boston-College instead of Boston College.

I need to be able to find these schools despite these mistakes and at least provide the user with a list of similar school names if a definite match does not exist. I do not want to use something like Sphinx or any full-text standalone search engine because this search only happens at registration and the strings are small.

Any thoughts as to solutions?

Thanks in advance guys.

Pat
  • 5,761
  • 5
  • 34
  • 50
hmind
  • 840
  • 1
  • 6
  • 15
  • One possible solution is to utilize jQuery's AutoComplete. – bricker Oct 18 '11 at 02:22
  • 2
    @hmind: your title says that your after an algorithm... An easy one to write would be: for every school, you compute it's *"Levenhstein Edit Distance"* with what the user entered. If your school database is correct, then the ones with the lowest edit distance will be the most likely matches. http://en.wikipedia.org/wiki/Levenshtein_distance You can get *very* fancy when doing such things but simply computing all the edit distances and choosing amongst the lowest ones should give good results. The algorithm itself is only a few lines of code (and there's a nice "dynamic programming" version) – TacticalCoder Oct 18 '11 at 02:24

3 Answers3

3

I use the Soundex hash function as implemented by MySQL. Docs. Works well when giving the user a drop down menu of possible matches and a "create new" action.

Larry K
  • 47,808
  • 15
  • 87
  • 140
  • Nice, never knew that mysql had built in soundex. It appears that postgres also supports both a soundex and difference function along with some other text algorithms. http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html – trogdor33 Oct 18 '11 at 04:44
1

You could check out the text gem, although I don't think it'd help for things like "Boston-College"/"Boston College". The range of those types of errors is pretty large; I'm not sure what the best way to deal with that would be.

Dave Newton
  • 158,873
  • 26
  • 254
  • 302
0

The one I use is called strikematch based on an answer here although it may be better suited for variable length strings.

#Returns between 0 and 1 based on how close two strings are
def strikematch(str1, str2)
  str1.downcase! 
  pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
    |pair| pair.include? " "}
  str2.downcase! 
  pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
    |pair| pair.include? " "}
  union = pairs1.size + pairs2.size 
  intersection = 0 
  pairs1.each do |p1| 
    0.upto(pairs2.size-1) do |i| 
      if p1 == pairs2[i] 
        intersection += 1 
        pairs2.slice!(i) 
        break 
      end 
    end 
  end 
  (2.0 * intersection) / union
end
Community
  • 1
  • 1
trogdor33
  • 175
  • 1
  • 8