I work in a health care company and I have trouble with the hospitalization report data. I have the data are coming from various sources: Excel Reports, Plain Text File, and in some cases paper. I managed to get all the data into an Excel File. But I am running into a problem where each person spelled and referred to the same hospital.
For Example: New York Presbyterian Hospital, I have seen more than 10 variation.
- New York Presbyterian Hospital
- NY Presbyterian Hospital
- Presbyterian Hospital
- Presb Hospital
- PresbHosp
- New_York_Presb_Hosp
- NYPresbHosp
- Columbia Presbyterian Medical Center
- NYP/Columbia University Medical Center
- New York Presbyterian Hospital Columbia University Medical
- A more more cases where the hospital name is misspelled
- A few of the different system string limit and cut off the string in random places, or maybe they copy and pasted incorrectly.
- Different nurses refer to the Hospital in a differently
In my effect I am trying to create a true database that can store all the membership's information, but I am running into a wall because each staff/department are naming the hospital in a different way. (There is a provider ID unique to each hospital), but most of the reports I received only included "name". I have over 2000 members with about 100-150 hospitals, but 3 or 4 times the amount of different names.
I know Levenshtein distance could be in use, but in such extreme case, is there a strategy to build a match? There are too much data to do by hands (time consuming), since this is one of the dozens or reports I am assigned. Any suggestion would be appreciated.