I am trying to identify near duplicate entries of names in a database. I am new to databases, however i am familiar with R. I can get clusters of near duplicates using fuzzy matching and soundex in R. However there are several names which are synonyms of each other. I would like to cluster the names based on this criteria along with the above ones.
I want to do as suggested in Techniques for finding near duplicate records but with synonyms. I understand there is a sort of database of synonyms for English words called WordNet with sets of synonyms called synsets. But the entries in the field names are in different formats and languages.
For example If know "R version 3.0.3" and "Warm Puppy" are synonyms. I want to be able to use such custom synsets syn1 <- c("R version 3.0.3", "Warm Puppy") for clustering near duplicates.
Down the road I would also like to separate homonyms in clusters based on entries in other fields(columns) of a record.
Is there any method to implement this in R?