Leveraging @RichardScriven's insight that adist
could be used (it calculates "approximate string distance". I made a function to be more comprehensive. Please note "trafos"
stands for the "transformations" used to determine the "distance" between two strings (example at bottom)
EDIT This answer can produce wrong/unexpected results; as pointed out by @wdkrnls:
I ran your function against "apple" and "big apple bagels" and it returned "appl". I would have expected "apple".
See the explanation below for the wrong result. We start with a function to get the longest_string
in a list:
longest_string <- function(s){return(s[which.max(nchar(s))])}
Then we can use @RichardSriven's work and the stringi
library:
library(stringi)
lcsbstr <- function(a,b) {
sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
longest_cmn_sbstr <- longest_string(cmn_sbstr)
return(longest_cmn_sbstr)
}
Or we can rewrite our code to avoid the use of any external libraries (still using R's native adist
function):
lcsbstr_no_lib <- function(a,b) {
matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
lengths<- attr(matches, 'match.length')
which_longest <- which.max(lengths)
index_longest <- matches[which_longest]
length_longest <- lengths[which_longest]
longest_cmn_sbstr <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
return(longest_cmn_sbstr )
}
Both above functions identify only 'hello '
as the longest common substring, instead of 'hello r'
(regardless of which argument is the longer of the two):
identical('hello',
lcsbstr_no_lib('hello', 'hello there'),
lcsbstr( 'hello', 'hello there'),
lcsbstr_no_lib('hello there', 'hello'),
lcsbstr( 'hello there', 'hello'))
LAST EDIT
Note some odd behavior with this result:
lcsbstr('hello world', 'hello')
#[1] 'hell'
I was expecting 'hello'
, but since the transformation actually moves (via deletion) the "o" in world to become the "o" in hello -- only the hell part is considered a match according to the M
:
drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1] vvvv v
#[1] "hello world"
This behavior is observed using this Levenstein tool -- it gives two possible solutions, equivalent to these two transformations
#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"
I don't know if we can configure adist
to prefer one solution over another? (the transformations have the same "weight" -- the same number of "M" and "D"'s -- don't know how to prefer the transformations with the greater number of consecutive M
)
Finally, don't forget adist allows you to pass in ignore.case = TRUE
(FALSE
is the default)
- Key to the
"trafos"
property of adist
; the "transformations" to get from one string to another:
the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements M
, I
, D
and S
indicating a match, insertion, deletion and substitution