longest common substring in R finding non-contiguous matches between the two strings

Question

I have a question regarding finding the longest common substring in R. While searching through a few posts on StackOverflow, I got to know about the qualV package. However, I see that the LCS function in this package actually finds all characters from string1 which are present in string2, even if they are not contiguous.

To explain, if the strings are string1 : "hello" string2 : "hel12345lo" I expect the output to be hel, however I get the output as hello. I must be doing something wrong. Please see my code below.

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

I have also tried the Rlibstree method but I still get substrings which are not contiguous. Also, the length of the substring is also off from my expectation.s Please see below.

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

Related question: http://stackoverflow.com/q/16196327/602276 — Andrie, Feb 01 '15 at 10:52
@Andrie, I tried the Rlibstree method from the link. However, I still get substrings which are not contiguous. Also the length of the matching substring is off. Have added info as an EDIT my original post above. Please have a look. — IAMTubby, Feb 01 '15 at 12:05
To clarify: qualV’s `LCS` function doesn’t find the longest common substring, it finds the longest common *subsequence* – hence the result you are getting. That’s the definition of a subsequence. These problems are related but have quite different solutions, and the longest common *subsequence* problem is a more classical problem in computer science, and hence is the one more often implemented. — Konrad Rudolph, Feb 01 '15 at 14:17

Rich Scriven · Answer 1 · 2015-02-01T16:03:45.400

14

Here are three possible solutions.

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

There are also adist() and agrep() in base R, and the stringdist package has a few functions that run the LCS method. Here's a look at stringsidt. It returns the number of unpaired characters.

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

Now that I've explored this a bit more, I think adist() might be the way to go. If we set counts=TRUE we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate() we can use that matrix to get the matches from a to b.

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

So the M values denote straight across matches. We can go and get the substrings with stri_sub()

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.

edited Feb 01 '15 at 16:03

answered Feb 01 '15 at 13:55

Rich Scriven

97,041
11
181
245

1

While this works for short strings, it’s quite inefficient (I don’t even know the asymptotic performance … O(n^3) maybe?), and there are much more efficient solutions to this problem. – Konrad Rudolph Feb 01 '15 at 14:20
Well I'm not sure about the performance. I received a comment from OP on one of my other answers asking for help here so I figured I'd try to help. – Rich Scriven Feb 01 '15 at 14:40
@KonradRudolph -I played around with `adist()`. It seems like that's probably the way to go here – Rich Scriven Feb 01 '15 at 15:58
1

For reference, `identical(stri_sub(a, 1, 1:nchar(a)), substring(a,1,1:nchar(a)))` – Nate Anderson Oct 28 '15 at 06:05
@KonradRudolph Could you point me in the direction of a more efficient method, I have a similar problem but a huge data set to run it on. – Vaibhav Aug 29 '18 at 14:12
1

@Vaibhav An efficient solution is described at https://en.wikipedia.org/wiki/Longest_common_substring_problem — Unfortunately I don’t think an implementation for R exists. – Konrad Rudolph Aug 29 '18 at 14:37
`stringdist(a, b, method="lcs")` implements the longest common sequence as far as I know – Julien May 31 '23 at 11:52

Nate Anderson · Answer 2 · 2021-05-30T19:20:07.560

Leveraging @RichardScriven's insight that adist could be used (it calculates "approximate string distance". I made a function to be more comprehensive. Please note "trafos" stands for the "transformations" used to determine the "distance" between two strings (example at bottom)

EDIT This answer can produce wrong/unexpected results; as pointed out by @wdkrnls:

I ran your function against "apple" and "big apple bagels" and it returned "appl". I would have expected "apple".

See the explanation below for the wrong result. We start with a function to get the longest_string in a list:

longest_string <- function(s){return(s[which.max(nchar(s))])}

Then we can use @RichardSriven's work and the stringi library:

library(stringi)
lcsbstr <- function(a,b) { 
  sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
  cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
  longest_cmn_sbstr <- longest_string(cmn_sbstr)
   return(longest_cmn_sbstr) 
}

Or we can rewrite our code to avoid the use of any external libraries (still using R's native adist function):

lcsbstr_no_lib <- function(a,b) { 
    matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
    lengths<- attr(matches, 'match.length')
    which_longest <- which.max(lengths)
    index_longest <- matches[which_longest]
    length_longest <- lengths[which_longest]
    longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
    return(longest_cmn_sbstr ) 
}

Both above functions identify only 'hello ' as the longest common substring, instead of 'hello r' (regardless of which argument is the longer of the two):

identical('hello',
    lcsbstr_no_lib('hello', 'hello there'), 
    lcsbstr(       'hello', 'hello there'),
    lcsbstr_no_lib('hello there', 'hello'), 
    lcsbstr(       'hello there', 'hello'))

LAST EDIT Note some odd behavior with this result:

lcsbstr('hello world', 'hello')
#[1] 'hell'

I was expecting 'hello', but since the transformation actually moves (via deletion) the "o" in world to become the "o" in hello -- only the hell part is considered a match according to the M:

drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1]  vvvv   v
#[1] "hello world"

This behavior is observed using this Levenstein tool -- it gives two possible solutions, equivalent to these two transformations

#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"

I don't know if we can configure adist to prefer one solution over another? (the transformations have the same "weight" -- the same number of "M" and "D"'s -- don't know how to prefer the transformations with the greater number of consecutive M)

Finally, don't forget adist allows you to pass in ignore.case = TRUE (FALSE is the default)

Key to the "trafos" property of adist; the "transformations" to get from one string to another:

the transformation sequences are returned as the "trafos" attribute of the return value, as character strings with elements M, I, D and S indicating a match, insertion, deletion and substitution

To add to your solution, if you know from which string - a or b you want to pick the LCS, you can add grep inside your function with 'longest_cmn_sbstr' as argument to return the full string. — Vaibhav, Aug 30 '18 at 07:33
I ran your function against "apple" and "big apple bagels" and it returned "appl". I would have expected "apple". — wdkrnls, May 28 '21 at 21:15
Yes @wdkrnls , I agree my solution is not correct for "longest" -- it relies on Levenstein, which may identify a different solution that involves "DELETIONS" (see edit of my answer) This is the reason you get "appl"; it's the same reason I get this result: `lcsbstr('hello world', 'hello')` `#[1] 'hell'` Maybe I can modify my regex so I don't look only for consecutive "M", but also check for "M" (matches) spanning "D" (deletions) — Nate Anderson, May 30 '21 at 19:09

score 2 · Answer 3 · edited Jun 01 '23 at 14:11

The LCSn function (PTXQC package) can find the longest common string for all strings in the input vector. A warning is that the shortest string is used as the base, so it might not give you what you want when comparing multiple strings. However, it is a good option to compare two sequences.

library(PTXQC)
LCSn(c("hello","hello world"))
#[1] "hello"

LCSn(c("hello", "hel123l5678o"))
#[1] "hel"

score 1 · Answer 4 · answered Feb 01 '15 at 12:22

I'm not sure what you did to get your output of "hello". Based on trial-and-error experiments below, it appears that the LCS function will (a) not regard a string as an LCS if a character follows what would otherwise be an LCS; (b) find multiple, equally-long LCS's (unlike sub() that finds just the first); (c) the order of the elements in the strings doesn't matter -- which has no illustration below; and (b) the order of the string in the LCS call doesn't matter -- also not shown.

So, your "hello" of a had no LCS in b since the "hel" of b was followed by a character. Well, that's my current hypothesis.

Point A above:

a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...

a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it

a= c("hello", "hel", "abcd") 
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character

Point B above:

a= c("hello", "hel", "abcd") 
b= c("hello", "abcd") 
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all

I'm sorry lawyeR, I have not been able to completely understand. I'm looking for a function which takes two strings as arguments and returns the substring of maximum length that is common between the two. I'm a bit confused reading the above post. — IAMTubby, Feb 01 '15 at 12:30
lawyeR, Ohh okay! But just to clarify, is there a better method to find the longest common substring between the two? — IAMTubby, Feb 01 '15 at 12:48

score 1 · Answer 5 · answered Jun 06 '18 at 09:30

1

df <- data.frame(A. = c("Australia", "Network"),
                 B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)

 auxFun <- function(x) {

   a <- strsplit(x[[1]], "")[[1]]
   b  <- strsplit(x[[2]], "")[[1]]
   lastchar <- suppressWarnings(which(!(a == b)))[1] - 1

   if(lastchar > 0){
     out <- paste0(a[1:lastchar], collapse = "")
   } else {
     out <- ""
   }

   return(out)
 }

 df$C. <- apply(df, 1, auxFun)

 df
 A.         B.    C.
 1 Australia    Austria Austr
 2   Network Netconnect   Net

answered Jun 06 '18 at 09:30

Juan Antonio Roldán Díaz

2,017
8
22

1

This will work in cases where the substring starts from the beginning of both the strings, however if the substring occurs in between some string, this will fail. – Vaibhav Aug 30 '18 at 07:24
Yes, you're right. But if you consider that the substring occurs in between some string, you can get multiple outputs for each pair. And, it is possible to adapt the code to obtain the first string that matches between some string. – Juan Antonio Roldán Díaz Aug 30 '18 at 08:07

kana · Answer 6 · 2021-05-31T02:00:52.387

Using biostrings:

library(Biostrings)
a= "hello"
b="hel123l5678o"
astr= BString(a)
bstr=BString(b)

pmatchPattern(astr, bstr)

returns:

  Views on a 12-letter BString subject
Subject: hel123l5678o
views:
      start end width
  [1]     1   3     3 [hel]
  Views on a 5-letter BString pattern
Pattern: hello
views:
      start end width
  [1]     1   3     3 [hel]

So I did a benchmark and while my answer does do the thing and actually gives you a lot more info, it is ~500x slower than @Rich Scriven lol.

system.time({
a= "hello"
b="123hell5678o"
rounds=100
for (i in 1:rounds) {
astr= BString(a)
bstr=BString(b)
pmatchPattern(astr, bstr)
}
})

system.time({
  c= "hello"
  d="123hell5678o"
  rounds=100
  for (i in 1:rounds) {
ta <- drop(attr(adist(c, d, counts=TRUE), "trafos"))
stri_sub(d, stri_locate_all_regex(ta, "M+")[[1]])
}
})

   user  system elapsed 
  2.476   0.027   2.510 

   user  system elapsed 
  0.006   0.000   0.005

Dorian Grv · Answer 7 · 2022-11-24T16:10:58.467

I adapted the answer of @Rich Scriven to my purpose. The goal is to find in a vector the longest commong string instead of the one between 2 strings. At the end it is possible use it then in data.table by group.

Example:

library(data.table)
library(stringi)

# create the function ------------------------------------

get.lcs.vector <- function(your.vector) {
  
  # get longest common string
  get.lcs <- function(x, y) {
    # get longest common string
    sb <- stri_sub(y, 1, 1:nchar(y))
    sstr <- na.omit(stri_extract_all_coll(x, sb, simplify=TRUE))
    result <- sstr[which.max(nchar(sstr))]
    return(result)
  }
  combi <- data.table(expand.grid(your.vector, your.vector, stringsAsFactors = F))[Var1 != Var2]
  combi.result <- unique(mapply(get.lcs, combi[[1]], combi[[2]]))
  lcs <- combi.result[which.min(nchar(combi.result))]
  return(lcs)
}

# example of data ------------------------------------

dt <- data.table(AN = c("APILCASERNB", "APILCASELNB", "APILCASEYHANB", 
                        "A15DPGY", "A15DPRD", "A15DPWH", "A15DPDB", "A15DPYW", "A15DPTL", 
                        "A15DP4PGY", "A15DP4PRD", "A15DP4PWH", "A15DP4PDB", "A15DP4PYW", 
                        "A15DP4PTL"),
                 Name = c("Example1", "Example1", "Example1", "Example2", 
                          "Example2", "Example2", "Example2", "Example2", "Example2", "Example3", 
                          "Example3", "Example3", "Example3", "Example3", "Example3"))

dt

##                AN     Name
##  1:   APILCASERNB Example1
##  2:   APILCASELNB Example1
##  3: APILCASEYHANB Example1
##  4:       A15DPGY Example2
##  5:       A15DPRD Example2
##  6:       A15DPWH Example2
##  7:       A15DPDB Example2
##  8:       A15DPYW Example2
##  9:       A15DPTL Example2
## 10:     A15DP4PGY Example3
## 11:     A15DP4PRD Example3
## 12:     A15DP4PWH Example3
## 13:     A15DP4PDB Example3
## 14:     A15DP4PYW Example3
## 15:     A15DP4PTL Example3


# smaller exmaple ------------------------------------

dt.ex <- dt[Name == unique(Name)[1]]
dt.ex

##               AN     Name
## 1:   APILCASERNB Example1
## 2:   APILCASELNB Example1
## 3: APILCASEYHANB Example1

get.lcs.vector(dt.ex$AN)

## [1] "APILCASE"

# you can also start from end like this
stri_reverse(get.lcs.vector(stri_reverse(dt.ex$AN)))



# Example on all data.table ------------------------------------

dt[, AN2 := get.lcs.vector(AN), Name]
dt

##                AN     Name      AN2
##  1:   APILCASERNB Example1 APILCASE
##  2:   APILCASELNB Example1 APILCASE
##  3: APILCASEYHANB Example1 APILCASE
##  4:       A15DPGY Example2    A15DP
##  5:       A15DPRD Example2    A15DP
##  6:       A15DPWH Example2    A15DP
##  7:       A15DPDB Example2    A15DP
##  8:       A15DPYW Example2    A15DP
##  9:       A15DPTL Example2    A15DP
## 10:     A15DP4PGY Example3  A15DP4P
## 11:     A15DP4PRD Example3  A15DP4P
## 12:     A15DP4PWH Example3  A15DP4P
## 13:     A15DP4PDB Example3  A15DP4P
## 14:     A15DP4PYW Example3  A15DP4P
## 15:     A15DP4PTL Example3  A15DP4P

longest common substring in R finding non-contiguous matches between the two strings

7 Answers7

Linked

Related