I would like to identify rows in a data frame that are highly similar to each other but not necessarily exact duplicates. I have considered merging all the data from each row into one string cell at the end and then using a partial matching function. It would be nice to be able to set/adjust the level of similarity required to qualify as a match (for example, return all rows that match 75% of the characters in another row).
Here is a simple working example.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))
In this scenario, I would want row 2 to show up as a duplicate of row 1, but not row 4 (It is too dissimilar). Thanks for any suggestions.