I have a function that returns the longest common substring from two strings:
longest.substring <-function(a,b)
{
A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]
L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
if(length(ones)!=0){
for(i in 1:nrow(ones)) {
v <- ones[i, , drop = FALSE]
L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
}
}
longest.substring("hello world","hella old") #returns "hell"
longest.substring("abc","def") #returns nothing
Originally found in Identify a common pattern, I added the if-clause to deal with strings that have no substring matches at all. It works fine as illustrated by the examples in code, but I have a problem applying it to my dataset. For each row of it I want to use this function on values of two columns and get the result into the third column. I tried a few times, for example:
table1$LCS <- mapply(longest.substring, table1$col1, table1$col2)
table1$LCS <- apply(table1[,c("col1","col2")], 1, function(x)
longest.substring(x["col1"],x["col2"]))
Both ways (I use mapply
for running adist
between these columns and works fine) return an error:
Error in 1:nrow(ones) : argument of length 0
From my testing of running it just on two strings, this is exactly what happens before I added if
, so function 'omits' this clause and tries to run for
which causes the error.
Also I would like to note that my dataset is quite large (several thousand rows), so I think for
loop will take ages to complete.
EDIT made for
loop too, but it returns the same errors as above.
for (i in 1:nrow(Adresy_baza_match)){
Adresy_baza_match[i,"LCS"] <- longest.substring(Adresy_baza_match[i,4], Adresy_baza_match[i,5])
}
EDIT I managed to isolate which row causes the error:
a b
921 BRUSKIEGO PLATYNOWA
922 BRUSKIEGO BPAHIERONIMAROZRAŻEWSKIEGO
923 BRUSKIEGO BPAKONSTANTYNADOMINIKA
The first row seems to cause it:
x <-longest.substring("BRUSKIEGO", "PLATYNOWA")
In this case (running the function code line-by-line length(ones)
is 2, while nrow(ones)
returns NULL
, which from my other tries happens every time there is only one matching substring which constists of a single char.