1

I have a large table with a list of company names and need a way of unifying the company names, e.g.

  • McDonalds Restaurant = McDonalds
  • McDonalds Fast Food = McDonalds
  • McDonalds Food 1234 = McDonalds
  • McDonald = McDonalds
  • McDnld = McDonalds
  • McDonalds Farm doesn't equal McDonalds
  • Microsoft -> Microsoft
  • Bare Essentials -> Bare Escentuals
  • Polycom, Inc. -> Polycom

Is there away to do this with out writing out each rule individually? Or at least generate some sort of percentage on the likely chance that One company name belongs to a certain company?

user1697891
  • 49
  • 1
  • 1
  • 8

3 Answers3

2

Try:

SELECT FROM `company` WHERE `name` LIKE
    "%McDonalds%Food%" or "%McDonalds%Restaurant%"

You'll need to handle each case individually since you're explicitly excluding %Farm from the resultset.

Daniel Li
  • 14,976
  • 6
  • 43
  • 60
  • If McDonalds was the only company, but I need to group all the companies for every variation. I was hoping for a way where I did not have to write a rule for all the variations for each company. Also, I would like it to take into a count abbreviations and misspellings. – user1697891 Sep 25 '12 at 16:27
2

If your doesn't equal would be much shorter you could do a NOT LIKE rule for each one of those. Otherwise there isn't really a way that SQL could tell one from the other. What I would do is to make a global company table that would hold the base name and tie to the child table with a base store ID.

Jonah Kunz
  • 670
  • 8
  • 19
1

The short answer is...no, at least not in SQL.

This sort of heuristic matching of names has been the subject of a lot of research.

Many SQL implementations have a Soundex function, but that works well (for some definition of "well") only for conventional Anglo-Saxon names (that were widely used a century ago). See http://www.immagic.com/eLibrary/ARCHIVES/GENERAL/LAS_US/L030206B.pdf for some issues with Soundex.

Community
  • 1
  • 1
Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
  • Thanks, this is definitely on the right track. I don't mind doing the computations in a different language. But how do things like Levenshtein handle stuff like MSFT=Microsoft or would another thing be needed for determining likenesses of a match for abbreviations? – user1697891 Sep 25 '12 at 18:30
  • I've got to imagine that mapping acronyms to abbreviations is going to require some sort of lookup. How could any algorithm figure out without contextual and cultural knowledge that `HOG` is the ticker symbol for Harley-Davidson and `KO` the ticker symbol for Coca-Cola? – Nicholas Carey Sep 25 '12 at 19:22