I have an already existing string of manually mapped geographic regions as follows:
Czech Republic-Construction Emerging CEEMEA CEEMEA
Czech Republic-Residential Emerging CEEMEA CEEMEA
Czech-Slovakia Emerging CEEMEA CEEMEA
Daiichi Sankyo US Developed North America North America
Dailian Emerging China Asia
Daimaru Developed Japan Japan
Dairy products Other Other Other
Dalian Emerging China Asia
So basically as you can see, I am mapping such regions to proper geographic locations and companies if any to 'Other'. The new regions that I encounter, have spell mistakes, so i use a set of algorithms to check if I have encountered some strings which are close enough and already mapped, if so, i copy the mapping to the new regions.
The following is the way I have used a set of algorithms.
//Levenshtein-Distance
if(LevenshteinDistance == 1)
Match string to existing entry.
else if(LevenshteinDistance == 2)
if(Jaro-Winkler > 0.85)
Match string to existing entry.
else if(LevenshteinDistance == 3)
if(WildCardMatching)
if(jaro-Winkler)
Match String to existing entry.
else
Add String to List for Manual Mapping.
else
Add String to List for Manual Mapping.
Wild Card Matching Algorithm:- http://www.geeksforgeeks.org/wildcard-character-matching/
Jaro-Winkler Algorithm:- Jaro–Winkler distance algorithm in C#
My question is even after this, i can still find entries that are mapped wrong, Eg:- Labor and Gabon. Is there a way to add more algorithm or change the way i am currently using these algorithms to make a better matching flow?
Thank you for any help.