Find common substrings between two character variables

Question

I have two character variables (names of objects) and I want to extract the largest common substring.

a <- c('blahABCfoo', 'blahDEFfoo')
b <- c('XXABC-123', 'XXDEF-123')

I want the following as a result:

[1] "ABC" "DEF"

These vectors as input should give the same result:

a <- c('textABCxx', 'textDEFxx')
b <- c('zzABCblah', 'zzDEFblah')

These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown.

Is there a solution, in one of the following places (in order of preference):

Base R
Recommended Packages
Packages available on CRAN

The answer to the supposed-duplicate does not fulfill these requirements.

Check out this question: http://stackoverflow.com/questions/1429476/longest-common-substring-problem — dave, Apr 24 '13 at 15:42
Also this: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/68013.html — Maxim.K, Apr 24 '13 at 15:45
http://svitsrv25.epfl.ch/R-doc/library/Biostrings/html/pmatchPattern.html, and this http://www.emoticode.net/r/longest-common-substring.html — Maxim.K, Apr 24 '13 at 15:46
I'm looking for a base function, or something in a recommended package (available on CRAN). — Matthew Lundberg, Apr 24 '13 at 15:57
@MatthewLundberg It doesn't look that it exists, unless Google chose to hide it. — Maxim.K, Apr 24 '13 at 16:11
Is your objection to the BSD license attached to Rlibstree (at the link in the first question above)? Are you OK with anything on CRAN, or just the [recommended packages](http://cran.r-project.org/doc/FAQ/R-FAQ.html#Add_002don-packages-from-CRAN)? — Josh O'Brien, Apr 24 '13 at 16:12
@JoshO'Brien, I can't find Rlibstree. Were you able to install it? — GSee, Apr 24 '13 at 16:13
Ok, I found it [here](http://bioconductor.org/packages/2.4/extra/html/Rlibstree.html), but I can't get the desired output with it. I'd love to upvote an answer that can. — GSee, Apr 24 '13 at 16:23
@GSee -- I have a slow download right now, but am interested too, and will investigate once I get the Bioconductor version installed (and all associated packages updated). — Josh O'Brien, Apr 24 '13 at 16:25
@MatthewLundberg, what you've is the *longest uncommon substring*, isn't it? For your first example, LCS is `blah`. — Arun, Apr 24 '13 at 16:33
The funny thing is that I looked at this and said, "hey, if those were numbers, this would be a correlation integral." Does `libstree`'s algorithm do anything similar? — Carl Witthoft, Apr 24 '13 at 17:22
@CarlWitthoft, If you weren't joking (sorry if so, for not getting the humor), then [**check this out**](http://en.wikipedia.org/wiki/Longest_common_substring) for what libstree does :). — Arun, Apr 24 '13 at 17:41
@Arun Thanks for the pointer. I was actually wondering about moving one string w.r.t. the other, and looking for maximal overlap, but realized that wouldn't give much useful information. — Carl Witthoft, Apr 24 '13 at 19:08

eddi · Accepted Answer · 2015-10-28T16:14:33.453

10

Here's a CRAN package for that:

library(qualV)

sapply(seq_along(a), function(i)
    paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
          collapse = ""))

edited Oct 28 '15 at 16:14

answered Apr 24 '13 at 17:18

eddi

49,088
6
104
155

1

(+1) for CRAN package. I find the input to this function *really* weird though. I mean, it's obvious that one has to create all suffixes. Why ask the user to do it? Nice thing is they've their own implementation of LCS in C. – Arun Apr 24 '13 at 17:38
it has to do with how the rest of the package works - most of the processing is done on vectors – eddi Apr 24 '13 at 17:53
I don't mind the function taking vector inputs. But it should just be `LCS(a[i], b[i])` is what I meant. We shouldn't be generating suffixes. – Arun Apr 24 '13 at 17:56
1

@Arun, yes, I understand what you were saying and what I'm saying is that the package is centered around processing of vectors of numbers and this function is meant to be LCSequence, which is a generalization of LCSubstring. It would of course be nice if someone wrote the specialized version for strings, but oh well. – eddi Apr 24 '13 at 18:15
This is not working for me. Here is my code: ` > library(qualV)` ` > a <- "* LEAR ROBINSO"` ` > b <- "* STAR CITY GA"` ` > c <- LCS(substring(a, seq(1, nchar(a)), seq(1, nchar(a))), substring(b,seq(1, nchar(b)), seq(1, nchar(b))))` ` > c$LCS` ` [1] "*" " " "A" "R" " " "I"` ` > c <- paste(LCS(substring(a, seq(1, nchar(a)), seq(1, nchar(a))), substring(b, seq(1, nchar(b)), seq(1, nchar(b))))$LCS,collapse="")` ` > c` ` [1] "* AR I"` – ConfusedMan Apr 08 '15 at 02:14
1

@ConfusedMan does this link describe your issue: http://stackoverflow.com/q/28261825/1175496 ? Were you expecting an output of only `'AR '` rather than `'AR I'`? – Nate Anderson Oct 28 '15 at 05:02
2

Also, I I think it's easier to replace `substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i])))` with `strsplit(a[i],'')[[1]]`. At first I just thought this was easier to read. But [this site] suggests `strsplit(...)` is also **more efficient** – Nate Anderson Oct 28 '15 at 06:52
1

Finally, as long as LCS is designed to accept vectors, here is a translation of strings to vectors: `function(a,b){ av <- strsplit(a,'')[[1]]; bv <- strsplit(b,'')[[1]]; return(paste(LCS(av,bv)$LCS, collapse='')); }` – Nate Anderson Oct 28 '15 at 07:58
@TheRedPea Understood. Thank you! – ConfusedMan Oct 28 '15 at 14:33
Sorry the link which says strsplit is [more efficient](https://stat.ethz.ch/R-manual/R-devel/library/base/html/substr.html) `## strsplit is more efficient ...` – Nate Anderson Oct 28 '15 at 16:16

score 9 · Answer 2 · answered Apr 24 '13 at 16:49

If you dont mind using bioconductor packages, then, You can use Rlibstree. The installation is pretty straightforward.

source("http://bioconductor.org/biocLite.R")
biocLite("Rlibstree")

Then, you can do:

require(Rlibstree)
ll <- list(a,b)
lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), 
           function(x) getLongestCommonSubstring(x))

# $X1
# [1] "ABC"

# $X2
# [1] "DEF"

On a side note: I'm not quite sure if Rlibstree uses libstree 0.42 or libstree 0.43. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was using libstree 0.42. Just a heads up.

score 0 · Answer 3 · answered Apr 25 '13 at 18:40

Because I have too many things I don't want to do, I did this instead:

Rgames> for(jj in 1:100) {
+ str2<-sample(letters,100,rep=TRUE)
+ str1<-sample(letters,100,rep=TRUE)
+ longs[jj]<-length(lcstring(str1,str2)[[1]])
+ }
Rgames> table(longs)
longs
 2  3  4 
59 39  2

Anyone care to do a statistical estimate of the actual distribution of matching strings? (lcstring is just a brute-force home-rolled function; the output contains all max strings which is why I only look at the first list element)

Find common substrings between two character variables

3 Answers3

Linked

Related