0

I am trying to write some code to determine if the letters in a small string are contained in a larger string in R. The Accuracy would then be returned by a percentage.

I found the following on StackOverflow (check if all characters of one string exist in another string in r), but the code provided calculates the average as the count of unique overlap divided by count of unique letters. i.e. it does not allow for repeated letters

s1 <- "ABBDEFGHIZ"
s2 <- "ABBDEFGHIJ"

compare <- function(s1, s2) {
  c1 <- unique(strsplit(s1, "")[[1]])
  c2 <- unique(strsplit(s2, "")[[1]])
  length(intersect(c1,c2))/length(c1)
}

compare(s1,s2)
[1] 0.8888889

Ideally, the above code should return a value of 0.9, as 9/10 of the letters are matched instead of 8/9.

Any advice would be appreciated.

Sotos
  • 51,121
  • 6
  • 32
  • 66
PMH123
  • 13
  • 3
  • Such function already exist. Try `RecordLinkage::levenshteinSim(s1, s2)` – Sotos Jan 28 '18 at 10:48
  • Thanks Sotos. I had used other metrics such as JW, but found flaws with them. This levenshteinSim works in some instances - however, upon checking some of my data. I had the following issue: – PMH123 Jan 28 '18 at 12:48

1 Answers1

-1

something like this:

compare <- function(s1, s2) {
  c1 <- strsplit(s1, "")[[1]]
  c2 <- strsplit(s2, "")[[1]]
  x=sum(c1%in%c2)
  x/length(unique(c(c1,c2)))
}
Antonios
  • 1,919
  • 1
  • 11
  • 18