1

I work in a health care company and I have trouble with the hospitalization report data. I have the data are coming from various sources: Excel Reports, Plain Text File, and in some cases paper. I managed to get all the data into an Excel File. But I am running into a problem where each person spelled and referred to the same hospital.

For Example: New York Presbyterian Hospital, I have seen more than 10 variation.

  1. New York Presbyterian Hospital
  2. NY Presbyterian Hospital
  3. Presbyterian Hospital
  4. Presb Hospital
  5. PresbHosp
  6. New_York_Presb_Hosp
  7. NYPresbHosp
  8. Columbia Presbyterian Medical Center
  9. NYP/Columbia University Medical Center
  10. New York Presbyterian Hospital Columbia University Medical
  11. A more more cases where the hospital name is misspelled
  12. A few of the different system string limit and cut off the string in random places, or maybe they copy and pasted incorrectly.
  13. Different nurses refer to the Hospital in a differently

In my effect I am trying to create a true database that can store all the membership's information, but I am running into a wall because each staff/department are naming the hospital in a different way. (There is a provider ID unique to each hospital), but most of the reports I received only included "name". I have over 2000 members with about 100-150 hospitals, but 3 or 4 times the amount of different names.

I know Levenshtein distance could be in use, but in such extreme case, is there a strategy to build a match? There are too much data to do by hands (time consuming), since this is one of the dozens or reports I am assigned. Any suggestion would be appreciated.

George
  • 4,514
  • 17
  • 54
  • 81
  • 3
    1-4 should be pretty easy to match. IMHO, the best idea would be a system checking for, say the 4 or 5 most frequent spellings, and for the rest (15%?), draw attention of a human with a given selection of closest correct names, with a friendly UI, for example checkboxes so he just needs to check one. – kebs Jan 16 '16 at 18:28
  • 2
    You could try to unify the spellings by converting all characters to lowercase, delete words meaningless for distinction (Medical, Hospital), and sort the words alphabetically. The SOUNDEX algorithm or phonetic matching in general might be helpful. – Axel Kemper Jan 16 '16 at 18:40
  • 1
    The Levenshtein distance will help you only with misspelled names. Another heuristic that might be helpful is the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index): the size of the intersection divieded by the size of te union. That would give you 1.0 for #1 and 0.5 for #3. Since the name must refer to a hospital, it might also be a good idea to remove common words that aren't useful in distinguishing the hospitals before matching, words like "hospital", "medical" and so on. (And if your hospitals are all in New York, "NY" and "New York", too.) – M Oehm Jan 16 '16 at 18:45
  • 2
    "100-150 hospitals, but 3 or 4 times the amount of different names." 3-400 names? You're going to spend longer writing and testing and checking a fuzzy matching system which is any good, than you are just doing them by hand. Especially if you can offload the hand work to someone else. You only need to match them up once, ever, and then save that, right? – TessellatingHeckler Jan 16 '16 at 20:33
  • @TessellatingHeckler There are on going updates every month, between 100-200 members will move in and out of hospital. I thought about the idea from kebs, and if I can keep these relationship... maybe I can reuse them on a monthly basis. – George Jan 16 '16 at 21:15
  • 1
    Creating an automated system that can determine the hospital name is a big "R" research problem. When I had to do something like this, I opted for the low-tech solution that took me less than a day to implement. See http://stackoverflow.com/a/20354639/56778. If you then saved the replacements file and added new entries each month as you encounter them, you'd have a pretty comprehensive list in a fairly short amount of time. – Jim Mischel Aug 30 '16 at 15:41

2 Answers2

2

This is a pretty standard and pretty difficult problem. Entire companies exist to solve it for big data.

The usual strategy is to encode what is known about the data domain in a heuristic algorithm to classify the data before putting it in the database.

A standard classification method would be to create a set of pattern strings for each hospital. The examples you gave might go in the pattern set initially.

Then for each incoming string and each pattern, calculate a metric that's the difference between the string and pattern. Levenshtein is a good starting point. The set containing the least distance pattern (in this case Columbia Presbyterian) wins. An excessive least distance means your pattern set is no good. (You get to tweak what "excessive" means.) More than one low distance (you get to define "low," too) means the pattern set has inadvertent overlaps.

Both problems may be handled in various ways, usually involving human intervention either to classify the data or enhance the pattern sets or both.

A second possibility is to use regexes as patterns. Then a match is equivalent to distance zero above, and a non-match is distance infinity. As you might expect, this makes the algorithm less flexible. Yet for some kinds of data - probably not yours though - it's the best choice.

Gene
  • 46,253
  • 4
  • 58
  • 96
0

You should look for "specific patterns" which your data is forming. What i have observed is, out of the strings that you've revealed-- "Presb" is the sub-string which is used in all strings (variations of hospital fields that you have been provided with). @M-ohem's comment is a nice approach as well. But for the starters, you can put up a regular expression which checks if any input string has the pattern "Persb" in it. Learn More