check if all characters of one string exist in another string in r

Question

I am trying to compare strings like PRABHAKAR SHARMA and SHARMA KUMAR PRABHAKAR. the intention is to check if all the characters of the shorter string exist in the other string. If that is the case, I should get a 100% match otherwise a percentage representing the percentage of characters that matched.

I tried using levenshteinSim in RecordLinkage package but it gives a number corresponding to the number of changes required to change one string to another.

install.packages("RecordLinkage")
require(RecordLinkage)
levenshteinSim("PRABHAKAR SHARMA","SHARMA KUMAR PRABHAKAR")

#[1] 0.3636364

I want a 100% match in such a case. Also, this has to be replicated for over 1,000,000 records.

Do you mean all _characters_ or all _words_ from the shorter string need to be matched in the longer string? — talat, Mar 18 '16 at 13:15
In case you are looking for word matches, you could try something like `long_strings <- "SHARMA KUMAR PRABHAKAR"; short_strings <- "PRABHAKAR SHARMA"; mapply(function(l, s) mean(s %in% l), strsplit(long_strings, " "), strsplit(short_strings, " "))` — talat, Mar 18 '16 at 13:25
all characters.. it could be further jumbled up.. for example- "PRA SHARBHAK ARRMA" and "PRABHAKAR SHARMA KUMAR".. — Oshan, Mar 18 '16 at 13:31

Karsten W. · Accepted Answer · 2016-03-18T14:38:07.047

5

Here is one approach

s1 <- "PRABHAKAR SHARMA"
s2 <- "SHARMA KUMAR PRABHAKAR"

compare <- function(s1, s2) {
    c1 <- unique(strsplit(s1, "")[[1]])
    c2 <- unique(strsplit(s2, "")[[1]])
    length(intersect(c1,c2))/length(c1)
}

compare(s1,s2)
#1

It may be a little slow, though. And it considers the space character as character, too. Use Vectorize to apply on a column:

dat <- data.frame(small=c("a", "b"), big=c("aa", "cc"), stringsAsFactors=FALSE)
vcomp <- Vectorize(compare)
dat <- transform(dat, comp=vcomp(small, big))

edited Mar 18 '16 at 14:38

answered Mar 18 '16 at 13:27

Karsten W.

17,826
11
69
103

You can check for case sensitivity with a boolen param, very nice :) – TheRimalaya Mar 18 '16 at 13:39
Thanks Karsten.. this is perfect.. Can you please suggest a way by which I can replicate it to compare two columns of a dataframe? – Oshan Mar 18 '16 at 14:19
Edited answer to address the column case. – Karsten W. Mar 18 '16 at 14:38
Thank you @KarstenW for the code to compare two columns of a dataframe. It seems like it only matches corresponding rows. Could you please suggest how to iterate over two data frames, so that 'a' is not only compared to 'aa', but also 'cc'? Thank you! – kate88 Aug 27 '18 at 14:44

score 3 · Answer 2 · answered Mar 18 '16 at 13:48

3

If the characters to be considered are only letters you could use:

comp <- function(s1, s2){         
     in1 = letters %in% strsplit(tolower(s1), "")[[1]]
     in2 = letters %in% strsplit(tolower(s2), "")[[1]]
     sum(in1 & in2)/sum(in1)
}

answered Mar 18 '16 at 13:48

tfc

596
3
7

check if all characters of one string exist in another string in r

2 Answers2

Linked