similarity between string rows in a data.frame

Question

I have a dataframe like this: pta corpus

Each row of pta_content is the contents of preferential trade agreements. I'm trying to calculate the similarities between each row and obtain a similarity matrix with the name of pta.

I have tried stringdist, it seems that stringdist is used for two dataframes. how can i calculate the pairwise similarities between each row within a dataframe?

may be just use `dist()`. Also it is always a good idea to share reproducible example. The image doesn't really helps here. — Rana Usman, Mar 29 '18 at 10:32
@RanaUsman dist() can only apply to a numeric matrix or data frame. I have string rows. — willwang, Mar 29 '18 at 10:38

score 0 · Answer 1 · answered Mar 29 '18 at 10:48

a <- c("abcdefg", "hijklmnop", "qrstuvwxyz")
b <- c("abXdeXg", "hiXklXnoX", "Xrstuvwxyz")

library(RecordLinkage)
levenshteinSim(a, b)

Result

[1] 0.7142857 0.6666667 0.9000000

Since the data is not there, there's not much I can do.

This is taken from Similarity scores based on string comparison in R (edit distance)

similarity between string rows in a data.frame

1 Answers1