-1

I have a function that returns the longest common substring from two strings:

longest.substring <-function(a,b)
{
  A <- strsplit(a, "")[[1]]
  B <- strsplit(b, "")[[1]]

  L <- matrix(0, length(A), length(B))
  ones <- which(outer(A, B, "=="), arr.ind = TRUE)
  ones <- ones[order(ones[, 1]), ]
  if(length(ones)!=0){
    for(i in 1:nrow(ones)) {
      v <- ones[i, , drop = FALSE]
      L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
    }
    paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
  }
}

longest.substring("hello world","hella old") #returns "hell"
longest.substring("abc","def") #returns nothing

Originally found in Identify a common pattern, I added the if-clause to deal with strings that have no substring matches at all. It works fine as illustrated by the examples in code, but I have a problem applying it to my dataset. For each row of it I want to use this function on values of two columns and get the result into the third column. I tried a few times, for example:

table1$LCS <- mapply(longest.substring, table1$col1, table1$col2)
table1$LCS <- apply(table1[,c("col1","col2")], 1, function(x)
                    longest.substring(x["col1"],x["col2"]))

Both ways (I use mapply for running adist between these columns and works fine) return an error:

Error in 1:nrow(ones) : argument of length 0

From my testing of running it just on two strings, this is exactly what happens before I added if, so function 'omits' this clause and tries to run for which causes the error.

Also I would like to note that my dataset is quite large (several thousand rows), so I think for loop will take ages to complete.

EDIT made for loop too, but it returns the same errors as above.

for (i in 1:nrow(Adresy_baza_match)){
  Adresy_baza_match[i,"LCS"] <- longest.substring(Adresy_baza_match[i,4], Adresy_baza_match[i,5])
}

EDIT I managed to isolate which row causes the error:

            a                          b
921 BRUSKIEGO                  PLATYNOWA
922 BRUSKIEGO BPAHIERONIMAROZRAŻEWSKIEGO
923 BRUSKIEGO     BPAKONSTANTYNADOMINIKA

The first row seems to cause it:

x <-longest.substring("BRUSKIEGO", "PLATYNOWA")

In this case (running the function code line-by-line length(ones) is 2, while nrow(ones) returns NULL, which from my other tries happens every time there is only one matching substring which constists of a single char.

PrzeM
  • 211
  • 3
  • 15

2 Answers2

1

A couple of points:

  1. This line in the code in the question:

    ones <- ones[order(ones[, 1]), ] 
    

    should be:

    ones <- ones[order(ones[, 1]), , drop = FALSE ] 
    
  2. Define longest.substring.vec which is like longest.substring except it accepts vector a and b rather than just scalars. It also coerces its arguments to character and replaces NULL with NA to ensure that the result is a character vector and not a list. Now try this:

    longest.substring.vec <- function(a, b, default = NA_character_, 
             USE.NAMES = FALSE) {
      a <- as.character(a)
      b <- as.character(b)
      m <- mapply(longest.substring, a, b, USE.NAMES = USE.NAMES)
      m[lengths(m) == 0] <- default
      unlist(m)
    }
    

To test out these two changes:

tab <- data.frame(a = c("hello, world", "abc"), b = c("hella old", "def"))
transform(tab, c = longest.substring.vec(a, b))
##              a         b    c
## 1 hello, world hella old hell
## 2          abc       def <NA>

Update:

Added point 1. Rearranged presentation.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I tried this (your code works fine), but when I use this on my dataset (replacing tab with my data frame reduced to two columns), it returns NAs for every row. – PrzeM Jan 15 '18 at 13:28
  • I the meantime I was trying to extract some suitable data sample, but another strange thing happens: when I do `tab <- as.data.frame(mytable[,c("col1","col2)])` and run your code on it, it returns only NAs for each row, but when I manually create a data table with the values copied and pasted from my dataset, it works normally. This is the manual entry table: `tab <- data.frame(a = c("ZIELONYTRÓJKĄT", "KOPERNIKA", "TRAKTŚWWOJCIECHA2994","POBIEDZISKA"), b = c("TRAKTŚWWOJCIECHA", "TRAKTŚWWOJCIECHA", "TRAKTŚWWOJCIECHA", "TRAKTŚWWOJCIECHA"))` – PrzeM Jan 15 '18 at 13:34
  • Could you clarify how should I refer to the column names in `transform`? Just rewritten the code, named my two columns "a" and "b" as in your example, now I am getting this length=0 same error as in the beginning. And another observation - the manually created table (working ok) is consisting of Factors with `typeof(tab$a)` returning `integer`, while the cut out from my dataset has `typeof(tab$a)` returning `character`. – PrzeM Jan 15 '18 at 13:41
  • You will need to provide a reproducible example that shows the input data and the specific command you are running to generate the error. The `tab` data frame you have defined in the comment when run with the `string.longest` function in the question and the `string.longest.vec` function and the `transform` statement in my response does not exhibit any errors. – G. Grothendieck Jan 15 '18 at 13:47
  • Ok, I found out the row causing trouble, added to the opening post. – PrzeM Jan 15 '18 at 14:11
  • Thanks, the error no longer occurs. Can you also advise how to properly address the columns in `transform`? The dataset has 6 columns, but I want to apply this function on only 2 of them, preferably calling them by their names instead of numbers. – PrzeM Jan 15 '18 at 14:42
  • Nevermind, I had some problems earlier, but now when I put column names (without " or ') as arguments of the function, it works fine. – PrzeM Jan 15 '18 at 14:55
  • The `transform` statement shown in the answer will create a `c` column and return the original data frame along with that new column. – G. Grothendieck Jan 15 '18 at 14:55
1

There's an easier and robust solution with GrpString package.

s <- c("hello world","hello old", "hello")

GrpString::CommonPatt(s) %>% 
filter(Freq_str == length(s)) %>% filter(Length == max(Length)) %>% 
select(Pattern) %>% unlist(use.names = F)

Check the output of GrpString::CommonPatt(s) for more information on common patterns

ishonest
  • 433
  • 4
  • 8