R + tm: How do I de-duplicate items in a list, based on semantic similarity?
v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv")
. My expected solution would be c("bank", "ford_suv',"toyota_suv","nissan_suv")
. That is, bank, banks and banking to be reduced to one term "bank." SnowBall::stemming
is not an option because I have to retain the flavor of newspaper styles of various countries. Any help or direction will be useful.

- 350
- 3
- 16
-
1If you don't despise `for loops`, maybe [this](https://www.dropbox.com/s/yepnihx03ar2dkx/relatedness.txt?dl=0) can be helpful. – Abdou Aug 15 '16 at 15:16
1 Answers
We could calculate the Levenshtein distance between words using adist
and regroup them into clusters using hclust
d <- adist(v)
rownames(d) <- v
Which gives a matrix of distance between terms:
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank 0 1 3 8 9 8 2 13 6 5 3 4
#banks 1 0 3 7 9 7 2 13 6 6 2 5
#banking 3 3 0 8 10 8 3 13 7 6 3 7
#ford_suv 8 7 8 0 5 6 8 12 7 7 8 4
#toyota_suv 9 9 10 5 0 6 9 7 4 9 9 9
#nissan_suv 8 7 8 6 6 0 8 13 10 4 8 10
#banker 2 2 3 8 9 8 0 12 6 6 1 6
#toyota_corolla 13 13 13 12 7 13 12 0 8 13 12 12
#toyota 6 6 7 7 4 10 6 8 0 6 7 5
#nissan 5 6 6 7 9 4 6 13 6 0 7 6
#bankers 3 2 3 8 9 8 1 12 7 7 0 6
#ford 4 5 7 4 9 10 6 12 5 6 6 0
Then we can pass it to hclust
using method = ward.D
cl <- hclust(as.dist(d), method = "ward.D")
plot(cl)
Which gives:
We notice 4 distinct clusters (that we can illustrate using rect.hclust(cl, 4)
)
Now, we can turn this result into a data.frame
and tag each cluster with it's shortest term:
library(dplyr)
data.frame(group = cutree(cl, 4)) %>%
tibble::rownames_to_column("term") %>%
group_by(group) %>%
mutate(tag = term[nchar(term) == min(nchar(term))])
Which gives:
#Source: local data frame [12 x 3]
#Groups: group [4]
#
# term group tag
# <chr> <int> <chr>
#1 bank 1 bank
#2 banks 1 bank
#3 banking 1 bank
#4 ford_suv 2 ford
#5 toyota_suv 3 toyota
#6 nissan_suv 4 nissan
#7 banker 1 bank
#8 toyota_corolla 3 toyota
#9 toyota 3 toyota
#10 nissan 4 nissan
#11 bankers 1 bank
#12 ford 2 ford
Should we want to extract only the unique tag
for each cluster, we could add ... %>% distinct(tag) %>% .$tag
to the pipe which would give:
#[1] "bank" "ford" "toyota" "nissan"
Reference
?adist
The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t (so that the transformation exactly matches t).
?hclust
This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
Note: I used data provided by @Abdou in the comments as it represents a more complete use case

- 21,343
- 7
- 57
- 77
-
1@Steven +100 internets to you. Fifty for a complete explanation of the underlying logic and 50 for the cross references. – Pradeep Aug 16 '16 at 05:36
-
@Steven: These are frequent terms from `findFreqTerms(dtm[mycluster1[1],],20)` where `mycluster` is `groups <- hclust(dist(m),method="ward.D") mycluster<-rect.hclust(groups,10) #ten clusters`, is there a way to pipe a `hclust` within the higher `hclust`? Don't want to open another question. Thanks. – Pradeep Aug 16 '16 at 05:42
-
@Pradeep I'd be happy to help but I'm not sure I follow. Perhaps you could open a new question with a [MCVE] ? – Steven Beaupré Aug 16 '16 at 09:22
-
@Steven: I am working on getting this MWE across to you. annonymization of data is a problem, though. – Pradeep Aug 25 '16 at 13:59