2

Is there a (easy) possibility to identify a common pattern which two strings share? Here is a little example to make clear what I mean:

I have two variables containing a string. Both include the same pattern ("ABC") and also some "noise".

a <- "xxxxxxxxxxxABCxxxxxxxxxxxx"
b <- "yyyyyyyyyyyyyyyyyyyyyyyABC"

Lets say I don't know the common pattern and I want R to find out that both strings contain "ABC". How can I do this?

*edit

The first example was maybe a bit to simplistic. Here is a example from my real data.

a <- "DUISBURG-HAMBORNS"
b <- "DUISBURG (-31.7.29)S"

Both strings contain "DUISBURG" which I want the function to identify.

*edit

I took the solution proposed in the link posted in the comments. But I still have not exactly what I want.

library(qualV)
LCS(strsplit(a[1], '')[[1]],strsplit(b[1], '')[[1]])$LCS

[1] "D" "U" "I" "S" "B" "U" "R" "G" "-" " " " " "S"

If the function is looking for the longest common subsequence of the two vectors, why does it not stop after "D" "U" "I" "S" "B" "U" "R" "G"? .

Alex
  • 4,925
  • 2
  • 32
  • 48

1 Answers1

7

Function LCS from qualV package (in Find common substrings between two character variables; not a possible duplicate) does something else than what you need. It solves the longest common subsequence problem, where subsequences are not required to occupy consecutive positions within the original sequences.

What you have is the longest common substring problem, for which you could use this algorithm, and here is the code assuming that there is a unique (in terms of length) longest common substring:

a <- "WWDUISBURG-HAMBORNS"
b <- "QQQQQQDUISBURG (-31.7.29)S"

A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]

L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
for(i in 1:nrow(ones)) {
  v <- ones[i, , drop = FALSE]
  L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
# [1] "DUISBURG"
Community
  • 1
  • 1
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • I found this function as an almost perfect solution to a similar problem I have, though I have one issue. How to make it handle cases where there is no commong substring? Specifically, I don't care about the exact substring but its length, so far I run this one and then use nchar, but the function returns error, when two strings don't have any matches in. – PrzeM Jan 15 '18 at 09:15
  • I have found that in abovementioned case `ones` is empty, so an if clause after it solved it. Another issue I have is how to apply it properly to my dataset? I need to compare strings from two columns row-by-row in the entire table. – PrzeM Jan 15 '18 at 09:44
  • @PrzeM, you can always run a `for` loop. Also, you could define a function `fun` that takes a vector of length 2 as an argument (first element would be `a`, second `b`) and do `apply(df, 1, fun)`. – Julius Vainora Jan 15 '18 at 12:44
  • The dataset is quite large (a few hundred thousand rows), so I would prefer to avoid loops because of performance issues. I have created a separate question with my code and details of the problem: https://stackoverflow.com/questions/48263456/applying-a-function-of-two-columns-over-each-row-of-data-frame – PrzeM Jan 15 '18 at 12:56
  • 1
    @PrzeM, my comment answers your question; since you care about efficiency I suggest you mention it there as well. – Julius Vainora Jan 15 '18 at 13:10
  • I have just tried implementing `for` loop, but it returns the same error as `apply`. – PrzeM Jan 15 '18 at 13:16