5

do you know any ready-to-use method to obtain length and also overlap of two strings? However only with R, maybe something from stringr? I was looking here, unfortunately without succes.

str1 <- 'ABCDE'
str2 <- 'CDEFG'

str_overlap(str1, str2)
'CDE'

str_overlap_len(str1, str2)
3

Other examples:

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'

str_overlap(str1, str2)
'CCTG'

str_overlap_len(str1, str2)
4

///

str1 <- 'foobarandfoo'
str2 <- 'barand'

str_overlap(str1, str2)
'barand'

str_overlap_len(str1, str2)
6

/// Yes two solutions, always pick always overlap

str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'

str_overlap(str1, str2)
'ABCDE'

str_overlap_len(str1, str2)
5

I was wonder about homemade small function for this, such as this one?

Adamm
  • 2,150
  • 22
  • 30
  • What is the expected output when input is `str1 <- c("EFGABCDE", "ABCDECDE")` ? Please provide more examples to understand the expected output. – zx8754 Feb 09 '18 at 08:04
  • https://www.geeksforgeeks.org/longest-repeating-and-non-overlapping-substring/ – Tim Biegeleisen Feb 09 '18 at 08:04
  • 3
    Are you looking for the longest common substring? Check out https://stackoverflow.co/search?q=%5Br%5D+longest+common+substring – talat Feb 09 '18 at 08:16
  • Related to [this](https://stackoverflow.com/questions/35381180/identify-a-common-pattern) – akrun Feb 09 '18 at 08:17
  • Yes, you got me. However, I'd like to solve this problem by myself, and avoid dynamic programming (if it's possible) – Adamm Feb 09 '18 at 08:18
  • Or https://stackoverflow.com/questions/16196327/find-common-substrings-between-two-character-variables – talat Feb 09 '18 at 08:18
  • 3
    I don't see what you mean by doing it yourself and avoiding dynamic programming – talat Feb 09 '18 at 08:19
  • Actually, I'd like to write a small function, that creates shortest superstring from several strings. My idea was to find longest overlap between any two strings in dataset, then mege them. Next step find longest overlap between already meged string and next string from dataset, and so on... Anyway great thank for everyone who was interested in this problem and lost a bit of their time here. – Adamm Feb 09 '18 at 08:24
  • See [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) package. – zx8754 Feb 09 '18 at 08:37
  • This is a useful question - 2nd hit for a google search for "R string overlap". SO people can really vote against their own interests sometimes.. – geotheory Feb 18 '18 at 18:53
  • 1
    Here's my implementation (adapted from @gaurav-taneja's, and more efficient if string a is longer than b): `str_overlap = function(a, b) { if(nchar(a) > nchar(b)){ a0 = a; a = b; b = a0 }; for(n in seq(1, nchar(a))) { sb = unique(combn(strsplit(a, "")[[1]], n, FUN=paste, collapse="")); if(length(unlist(str_extract_all(b, sb))) == 0){ r = prior; return(r) }; prior = unlist(str_extract_all(b, sb)) }; prior }` – geotheory Feb 19 '18 at 21:32

2 Answers2

4

It seems to me that you (OP) are not very concerned with performance of the code but more interested in a potential approch to solve it without readymade functions. So here is an example I came up with to compute the longest common substring. I have to note that this only returns the first largest common substring found even when there can be several of the same length. This is something you could modify to fit your needs. And please don't expect this to be super fast - it won't.

foo <- function(str1, str2, ignore.case = FALSE, verbose = FALSE) {

  if(ignore.case) {
    str1 <- tolower(str1)
    str2 <- tolower(str2)
  }

  if(nchar(str1) < nchar(str2)) {
    x <- str2
    str2 <- str1
    str1 <- x
  }

  x <- strsplit(str2, "")[[1L]]
  n <- length(x)
  s <- sequence(seq_len(n))
  s <- split(s, cumsum(s == 1L))
  s <- rep(list(s), n)

  for(i in seq_along(s)) {
    s[[i]] <- lapply(s[[i]], function(x) {
      x <- x + (i-1L)
      x[x <= n]
    })
    s[[i]] <- unique(s[[i]])
  }

  s <- unlist(s, recursive = FALSE)
  s <- unique(s[order(-lengths(s))])

  i <- 1L
  len_s <- length(s)
  while(i < len_s) {
    lcs <- paste(x[s[[i]]], collapse = "")
    if(verbose) cat("now checking:", lcs, "\n")
    check <- grepl(lcs, str1, fixed = TRUE)
    if(check) {
      cat("the (first) longest common substring is:", lcs, "of length", nchar(lcs), "\n")
      break
    } else {
      i <- i + 1L 
    }
  }
}

str1 <- 'ABCDE'
str2 <- 'CDEFG'
foo(str1, str2)
# the (first) longest common substring is: CDE of length 3 

str1 <- 'ATTAGACCTG'
str2 <- 'CCTGCCGGAA'
foo(str1, str2)
# the (first) longest common substring is: CCTG of length 4

str1 <- 'foobarandfoo'
str2 <- 'barand'
foo(str1, str2)
# the (first) longest common substring is: barand of length 6 

str1 <- 'EFGABCDE'
str2 <- 'ABCDECDE'
foo(str1, str2)
# the (first) longest common substring is: ABCDE of length 5 


set.seed(2018)
str1 <- paste(sample(c(LETTERS, letters), 500, TRUE), collapse = "")
str2 <- paste(sample(c(LETTERS, letters), 250, TRUE), collapse = "")

foo(str1, str2, ignore.case = TRUE)
# the (first) longest common substring is: oba of length 3 

foo(str1, str2, ignore.case = FALSE)
# the (first) longest common substring is: Vh of length 2 
talat
  • 68,970
  • 21
  • 126
  • 157
  • 1
    Thanks for your effort. Currently I'm trying to create same function, I'll compare results. Of course, in this particular case performance isn't important at all. I just try to solve some boinformatic problems for my own practice and training. Cheers! – Adamm Feb 09 '18 at 09:28
1

Hope this helps:

library(stringr)

larsub<-function(x) {
  a<-x[1]
  b<-x[2]
  # get all forward substrings of a
  for(n in seq(1,nchar(a)))
    {
    sb<-unique(combn(strsplit(a, "")[[1]],n, FUN=paste, collapse=""))
    if(length(unlist(str_extract_all(b,sb)))==0){ 
      r<-prior
      return(r)
      }
    prior<-unlist(str_extract_all(b,sb))
    }

}

c1<-larsub(c('ABCD','BCDE'))
c2<-larsub(c('ABDFD','BCDE'))
c3<-larsub(c('CDEWQ','DEQ'))
c4<-larsub(c('BNEOYJBELMGY','BELM'))
print(c1)
print(c2)
print(c3)
print(c4)

Output:

> print(c1) [1] "BCD" > print(c2) [1] "B" "D" > print(c3) [1] "DEQ" > print(c4) [1] "BELM" `

Diclaimer: the logic was borrowed from the lcs answer here: longest common substring in R finding non-contiguous matches between the two strings posted by @Rick Scriven

Gaurav Taneja
  • 1,084
  • 1
  • 8
  • 19
  • You should not use combn(). In some case, it will return wrong result. e.g, larsub(c('XABCD','XBCDE')) #RETURN 'XBCD' . Instead, replace line "sb<-..." to sb<-dimnames(stringdist::qgrams(a,q=n))[[2]] – Ben2018 May 29 '22 at 17:13