R implementation for Finding the longest common starting substrings in a set of strings

Question

This question is just asking for an implementation in R of the following question : Find the longest common starting substring in a set of strings (JavaScript)

"This problem is a more specific case of the Longest common substring problem. I need to only find the longest common starting substring in an array".

So im just looking an R implementation for this question (preferably not in a for / while loop fashion that was suggested in the JavaScript version), if possible i would like to wrap it up as a function, so i could apply on many groups in a data table.

After some searches, i couldn't find an R example for this, hence this question.

Example Data: I have the following vector of characters:

dput(data)
c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")

I'm looking to run an algorithm in R that will find the following output: ADA44.

From what I've seen in the JavaScript accepted answer, the idea is to first sort the vector, extract the first and last elements (for example : "ADA4417-3ARMZ-R7" and "ADA4432-1BRJZ-R2" , break them into single characters, and loop through them until one of the characters don't match (hope im right)

Any Help on that would be great!

Cath · Accepted Answer · 2015-02-02T09:09:39.930

12

Taking inspiration from what you suggested, you can try this function :

comsub<-function(x) {
    # sort the vector
    x<-sort(x)
    # split the first and last element by character
    d_x<-strsplit(x[c(1,length(x))],"")
    # compute the cumulative sum of common elements
    cs_x<-cumsum(d_x[[1]]==d_x[[2]])
    # check if there is at least one common element
    if(cs_x[1]!=0) {
        # see when it stops incrementing and get the position of last common element
        der_com<-which(diff(cs_x)==0)[1]
        # return the common part
        return(substr(x[1],1,der_com))
    } else { # else, return an empty vector
        return(character(0))
    }
}

UPDATE

Following @nicola suggestion, a simpler and more elegant variant for the function:

comsub<-function(x) {
    # sort the vector
    x<-sort(x)
    # split the first and last element by character
    d_x<-strsplit(x[c(1,length(x))],"")
    # search for the first not common element and so, get the last matching one
    der_com<-match(FALSE,do.call("==",d_x))-1
    # if there is no matching element, return an empty vector, else return the common part
    ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}

Examples:

With your data

x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"

When there is no common starting substring

x<-c("abc","def")
> comsub(x)
# character(0)

edited Feb 02 '15 at 09:09

answered Feb 02 '15 at 08:41

Cath

23,906
5
52
86

3

Nice answer. I'd change the third line in `der_com<-match(FALSE,do.call("==",d_x))-1`, maybe more elegant and efficient. – nicola Feb 02 '15 at 08:49
1

@YehoshaphatSchellekens, you're welcome, however, I just edited my answer to modify the function because the former one didn't behave that good when there was no common starting substring... I'll do some other tests to make sure everything always goes well... – Cath Feb 02 '15 at 08:54
1

@nicola, thanks for the suggestion, it indeed is more elegant ! (I think it also might help simplify the function when there is no common starting part !). Thanks again, I'll edit my function – Cath Feb 02 '15 at 08:55
1

@YehoshaphatSchellekens, see the edited functions. Now you have two variants of a function that should give you the longest common starting substring for (hopefully) any given character vector – Cath Feb 02 '15 at 09:08
`sort` is quite expensive (on a large set). `range` would be a good alternative for just plucking out the min and the max, which is all that is needed. – dsz May 03 '22 at 06:11

score 5 · Answer 2 · answered Feb 02 '15 at 09:44

5

A non-base alternative, using the lcprefix function in Biostrings to find the "Longest Common Prefix [...] of two strings"

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)

x2 <- sort(x)
substr(x2[1], start = 1, stop = lcprefix(x2[1], x2[length(x2)]))
# [1] "ADA44"

answered Feb 02 '15 at 09:44

Henrik

65,555
14
143
159

thanks @Henrik, is there any reason to believe that this would be faster than the base version? – Yehoshaphat Schellekens Feb 02 '15 at 09:47
1

@YehoshaphatSchellekens The wording on the [Biostrings home page](http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html) sounds promising: "Memory efficient string containers, string matching algorithms, and other utilities, for fast manipulation of large biological sequences or sets of sequences.", but I think you need to do some benchmarking. Also note the warning on `?lcprefix`. I doubt `substr` is the bottleneck here but you _may_ cut some microseconds using `stringi::stri_sub`. – Henrik Feb 02 '15 at 09:52

score 2 · Answer 3 · answered Nov 16 '15 at 13:53

Piggybacking off Henrik's answer, Bioconductor has a C based prefix function and an R based one. The R based one is:

lcPrefix <- function (x, ignore.case = FALSE) 
{
    x <- as.character(x)
    if (ignore.case) 
        x <- toupper(x)
    nc <- nchar(x, type = "char")
    for (i in 1:min(nc)) {
        ss <- substr(x, 1, i)
        if (any(ss != ss[1])) {
            return(substr(x[1], 1, i - 1))
        }
    }
    substr(x[1], 1, i)
}
<environment: namespace:Biobase>

... and doesn't require any special features of Bioconductor (as far as I can tell).

--- Citation ---

Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods,
2015:12, 115.

score 1 · Answer 4 · answered Nov 16 '15 at 14:38

1

Here is a compact solution:

data<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", "ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")

substr(data[1],1,which.max(apply(do.call(rbind,lapply(strsplit(data,''),`length<-`,nchar(data[1]))),2,function(i)!length(unique(i))==1))-1)
[1] "ADA44"

answered Nov 16 '15 at 14:38

Catherine Smith

175
14

This line of code is terrible and terrific at the same time, but thanks for sharing! :) – Ni-Ar Nov 20 '21 at 18:02

R implementation for Finding the longest common starting substrings in a set of strings

4 Answers4

Linked