0

my mysql table has a cityname column. the values in the cityname column has some spelling mistakes. each of the spelling mistakes of city needs to be corrected. another table has a column that contains all correct city names. based on this table the spelling mistakes of the first table are needed to be corrected. i researched over stuff like soundex but couldn't find any link which has done something similar.

TableA 

+----+------------+----------+
| id | col1      | city_name   |
+----+------------+----------+


TableB

+----+------------+--
| index |City_name      |
+----+------------+--

The approach i am having in the mind is that getting a sql query that creates seperate table out of the rows which have similar sounding city names.

once this is done replacing wrong spellings with correct ones again using an sql query

and finally combining all different tables into one table with all corrected spellings.

i am looking for advice both in approach as well as mysql query syntax

silverkid
  • 9,291
  • 22
  • 66
  • 92
  • You may want to consider a Levenstein distance, rather than SoundEx http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance – spencer7593 Jun 22 '12 at 14:08
  • Look up Levenshtein distance. This is an algorithm developed in the 1960s for handling misspelllings. This posting might help you http://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql. – Gordon Linoff Jun 22 '12 at 14:09
  • [This is another good question](http://stackoverflow.com/questions/42013/levenshtein-distance-based-methods-vs-soundex) on using Levenshtein distance instead of soundex. – N West Jun 22 '12 at 14:11
  • How will you handle city names like "Pittsburgh" PA and "Pittsburg" CA - they shoudn't be renamed to one another. You should probably include zip code or state as well for safety... – N West Jun 22 '12 at 14:19
  • @ N West yes infact there is zip code and state – silverkid Jun 22 '12 at 15:54

1 Answers1

3

There is going to be some manual work involved, and building a front end for it may not be the trouble if this is a one time thing.

What I would do is the following:

  1. Generate a list of all misspellings.
  2. Generate suggestions based on the soundex
  3. Manually go over the list, manually selecting the right one, and run an update statement manually per fix.

So, how to do this:

SELECT * FROM TableA as orig
LEFT OUTER JOIN
TableB as correct
ON SOUNDEX(orig.city_name) = SOUNDEX(correct.city_name)
WHERE orig.City_name NOT IN (SELECT City_name FROM TableB)

and write the update statements by hand. You might get no suggestions through the soundex, or get multiple selections, which you are going to have to resolve yourself. Computers just aren't that smart.

Martijn
  • 11,964
  • 12
  • 50
  • 96
  • gettting an error unknown column tableA.city_name in subquery – silverkid Jun 22 '12 at 16:04
  • ok this fixed it SELECT * FROM TableA LEFT OUTER JOIN TableB ON soundex(tableA.city_name) = soundex(tableB.city_name) WHERE tableA.City_name NOT IN (SELECT City_name FROM TableB) – silverkid Jun 22 '12 at 18:00
  • Thanks, I overlooked that. I fixed it slightly different in an edit, to use the aliases. Thing tend to get clearer from aliases IMO, but then again, your tables aren't called `TableA` and `TableB` (or at least, I hope they don't!) – Martijn Jun 25 '12 at 19:16