2

I am trying to compare strings like PRABHAKAR SHARMA and SHARMA KUMAR PRABHAKAR. the intention is to check if all the characters of the shorter string exist in the other string. If that is the case, I should get a 100% match otherwise a percentage representing the percentage of characters that matched.

I tried using levenshteinSim in RecordLinkage package but it gives a number corresponding to the number of changes required to change one string to another.

install.packages("RecordLinkage")
require(RecordLinkage)
levenshteinSim("PRABHAKAR SHARMA","SHARMA KUMAR PRABHAKAR")

#[1] 0.3636364

I want a 100% match in such a case. Also, this has to be replicated for over 1,000,000 records.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Oshan
  • 176
  • 15
  • Do you mean all _characters_ or all _words_ from the shorter string need to be matched in the longer string? – talat Mar 18 '16 at 13:15
  • 2
    In case you are looking for word matches, you could try something like `long_strings <- "SHARMA KUMAR PRABHAKAR"; short_strings <- "PRABHAKAR SHARMA"; mapply(function(l, s) mean(s %in% l), strsplit(long_strings, " "), strsplit(short_strings, " "))` – talat Mar 18 '16 at 13:25
  • all characters.. it could be further jumbled up.. for example- "PRA SHARBHAK ARRMA" and "PRABHAKAR SHARMA KUMAR".. – Oshan Mar 18 '16 at 13:31

2 Answers2

5

Here is one approach

s1 <- "PRABHAKAR SHARMA"
s2 <- "SHARMA KUMAR PRABHAKAR"

compare <- function(s1, s2) {
    c1 <- unique(strsplit(s1, "")[[1]])
    c2 <- unique(strsplit(s2, "")[[1]])
    length(intersect(c1,c2))/length(c1)
}

compare(s1,s2)
#1

It may be a little slow, though. And it considers the space character as character, too. Use Vectorize to apply on a column:

dat <- data.frame(small=c("a", "b"), big=c("aa", "cc"), stringsAsFactors=FALSE)
vcomp <- Vectorize(compare)
dat <- transform(dat, comp=vcomp(small, big))
Karsten W.
  • 17,826
  • 11
  • 69
  • 103
  • You can check for case sensitivity with a boolen param, very nice :) – TheRimalaya Mar 18 '16 at 13:39
  • Thanks Karsten.. this is perfect.. Can you please suggest a way by which I can replicate it to compare two columns of a dataframe? – Oshan Mar 18 '16 at 14:19
  • Edited answer to address the column case. – Karsten W. Mar 18 '16 at 14:38
  • Thank you @KarstenW for the code to compare two columns of a dataframe. It seems like it only matches corresponding rows. Could you please suggest how to iterate over two data frames, so that 'a' is not only compared to 'aa', but also 'cc'? Thank you! – kate88 Aug 27 '18 at 14:44
3

If the characters to be considered are only letters you could use:

comp <- function(s1, s2){         
     in1 = letters %in% strsplit(tolower(s1), "")[[1]]
     in2 = letters %in% strsplit(tolower(s2), "")[[1]]
     sum(in1 & in2)/sum(in1)
}
tfc
  • 596
  • 3
  • 7