Identifying near duplicate entries using synonyms in R

Question

I am trying to identify near duplicate entries of names in a database. I am new to databases, however i am familiar with R. I can get clusters of near duplicates using fuzzy matching and soundex in R. However there are several names which are synonyms of each other. I would like to cluster the names based on this criteria along with the above ones.

I want to do as suggested in Techniques for finding near duplicate records but with synonyms. I understand there is a sort of database of synonyms for English words called WordNet with sets of synonyms called synsets. But the entries in the field names are in different formats and languages.

For example If know "R version 3.0.3" and "Warm Puppy" are synonyms. I want to be able to use such custom synsets syn1 <- c("R version 3.0.3", "Warm Puppy") for clustering near duplicates.

Down the road I would also like to separate homonyms in clusters based on entries in other fields(columns) of a record.

Is there any method to implement this in R?

score 1 · Answer 1 · answered Mar 14 '14 at 13:40

1

Crops, this is not an answer but might help with you or others who answer.

As I assume you know, the TM package allows custom stop words, but I can't recall a custom vector of synonyms as in your Warm Puppy example. That would be very useful.

Second, Tyler Rinker's qdap package has lots of capabilities and might have (or he might create) such a synonym capability.

Third, the RTextTools package amalgamates many packages and functions. The team behind it may help.

It would be very useful to have a synonym-vector capability for what I do. Good luck and I will check back.

answered Mar 14 '14 at 13:40

lawyeR

7,488
5
33
63

1

Yes @user2583119 this is not an answer, but I think you have pushed the discussion in the right direction. The qdap package has a synonym searching function syn which uses an in built dictionary "SYNONYMN". If a custom dictionary as a dataframe can be used for the same, then we can try to get desired synonym clustering. – Crops Mar 15 '14 at 04:51
there is a user defined synonym lookup option now in qdap thanks to Tyler Rinkler. I will try to use that. – Crops Mar 18 '14 at 05:40

Identifying near duplicate entries using synonyms in R

1 Answers1

Linked