`r`: Efficiently identifying identical parts of two strings w/o prior knowledge of the pattern?

Question

I'm looking for an efficient solution to the following problem:

a <- "TestStringA"
b <- "TeststringB"
magical_string_processing(a, b)
> [1] "Test" "tring"

In essence: How do I detect identical parts of strings without knowing them pre hoc? How to do this operating from known patterns and using regular expressions is easy, but without the knowledge?

<--EDIT--> Answers have been given and I'll explore them, but in the meantime stackoverflow tags gave me nomenclature pointers and I explored e.g. https://stackoverflow.com/a/50705861/2103880, which leads to

a <- "TestStringA"
b <- c(paste(LETTERS, collapse = ""), "TeststringB")
stringdist::amatch(a,b, method = "lcs", maxDist = Inf)
[1] 2

The matching string is thus nicely identified, but the actual substring not extracted ...

Thank you for any pointers.

It's easy to get the identical part of `Testtring` but to take Test as one word and StringA as another, probably impossible without some sort of dictionary — Sotos, Dec 31 '19 at 09:28
Can you include few more examples of input and expected output? How do you define "identical parts"? — Ronak Shah, Dec 31 '19 at 09:41
@RonakShah - The original was indeed ambiguous and I tried to fix this by an edit. Identity would be a one-to-one character match (one could govern this by an additional `ignore_case` flag, but that's window dressing). — balin, Dec 31 '19 at 09:57

score 1 · Accepted Answer · 2019-12-31T17:06:40.247

1

You can use adist and pull the transformation sequence. Then just locate where in the string that is, and extract the substring(s).

library(stringr)

a <- "TestStringA"
b <- "TeststringB"

magical_string_processing <- function(a, b) {

  match_pattern <- c(attr(adist(a, b, counts = TRUE), "trafos"))
  pattern_locate <- str_locate_all(match_pattern, "M+")[[1]]
  str_sub(b, pattern_locate[,1], pattern_locate[,2])

}

> magical_string_processing(a, b)
[1] "Test"  "tring"

This seems to fail with excessive whitespace. In that case, I find adding this into the function helps. If you need the whitespace, then I don't think this solution works great without some more tinkering. The issue seems to be with deciphering the "trafos" results, maybe with the deletions ("D").

a <- str_squish(a)
b <- str_squish(b)

edited Dec 31 '19 at 17:06

answered Dec 31 '19 at 14:10

NICE! THis is along the lines of what I'm looking for - I'll report the final result. Note that `stringr::str_squish` implies a trimming of terminal white space and the `stringr::str_trim` thus appears superfluous. – balin Dec 31 '19 at 16:55
1

For `a <- "TestStringA1testes 45321"` and `b <- "TeststringB2ees 12345"` this returns `[1] "Test" "tring" "e" "123" "" `, with or without `str_squish()`. My implementation returns `[1] "Test" "tring" "es" "45"`. I suppose it depends on what output one expects. – Radim Jan 01 '20 at 18:06
Yeah I was seeing weird things too. I think it comes down to needing a bit more sophistication with interpreting the trafos results. – Jan 01 '20 at 18:54
Is there an opposite of this magical_string_processing(a,b) that will take the non-marching paert of both strings? – stats_noob Jul 06 '22 at 20:23

Allan Cameron · Answer 2 · 2019-12-31T12:33:42.177

0

Maybe this?

magical_string_processing <- function(a, b)
{
  result <- character();
  broken_a <- unlist(strsplit(a, ""))
  broken_b <- unlist(strsplit(b, ""))
  stop_at <- min(c(length(broken_a), length(broken_b)))
  temp_word <- character();
  for(i in 1:stop_at)
  {
    if(broken_a[i] == broken_b[i]) temp_word <- c(temp_word, broken_a[i])
    else if(length(temp_word) > 0)
    {
      result <- c(result, paste0(temp_word, collapse = ""))
      temp_word <- character()
    }
  }
  return(result)
}

a <- "TestStringA"
b <- "TeststringB"
magical_string_processing(a, b)
#> [1] "Test" "tring"

edited Dec 31 '19 at 12:33

answered Dec 31 '19 at 11:01

Allan Cameron

147,086
7
49
87

Nice looping, but this will not generalize to strings of unequal length, try it with `a <- "TestStringA1test"` and `b <- "TeststringB2es"`. – Radim Dec 31 '19 at 11:47
@Radim you are right - I meant to use min() instead of max() for stop at. Now updated. – Allan Cameron Dec 31 '19 at 12:33

Radim · Answer 3 · 2019-12-31T20:14:59.973

Task is to extract identical strings with a constraint on positionality:

a <- "TestStringA"
b <- "TeststringB"

Expected matches: "Test", "tring"

Solution:

    magical_string_processing <- function(stringA, stringB) {

# split strings into individual characters

    split_stringA <- unlist(strsplit(stringA, split =  ""))
    split_stringB <- unlist(strsplit(stringB, split =  ""))


    # pad vectors of unequal lengths with NAs

if (length(split_stringA) < length(split_stringB)) {
    identical_A <- c(split_stringA, rep(NA, 
                                      (length(split_stringB)-length(split_stringA))))

    identical_B <- split_stringB

} else { 
    identical_B <- c(split_stringB,rep(NA,
                                       (length(split_stringA)-length(split_stringB)))) 

    identical_A <- split_stringA
}


    # compare padded vectors
   compared <- identical_A == identical_B

   # remove NAs left from padding
   compared <- na.omit(compared)

   # use names attribute to preserve information about positions 
   names(compared) <- seq(1, length(compared))

   # split comparison based on TRUE/FALSE switches
   list_indices <- split(compared, cumsum(c(0,diff(compared)!=0)))

   # define function to replace logicals with positions
   pick_indices <-  function(vec) {

       replaced <- if (sum(vec)!=0) {

           # replace TRUEs with positions preserved in names 

           as.numeric(names(vec)) } else {

               # set FALSEs to NULL

                              NULL}
       return(replaced)

   }

   # apply the function to get a list of valid positions
   picker_list <- lapply(list_indices, pick_indices)

   # remove NULL matches
   picker_list_clean <- Filter(Negate(is.null), picker_list)


# define function to select matching characters from original strings
selector_helper <-  function(x, string) string[x]

# apply the helper function to either of the strings
temp_result <- lapply(picker_list_clean, selector_helper, split_stringA)

# collapse the matching characters back into strings
result <- sapply(temp_result, paste, collapse = "")

# can insert result <- trimws(result) if spaces should not be in the output

names(result) <- NULL

return(result)

}

Results:

a <- "TestStringA"
b <- "TeststringB"
magical_string_processing(a,b)
[1] "Test"  "tring"

Reverse string order:

magical_string_processing(b,a)
[1] "Test"  "tring"

Other examples:

a <- "Match1NOMATCH    match2test"
b <- "Match1nomatch    match2"
magical_string_processing(a,b)
[1] "Match1"     "    match2"

a <- "TestStringA1testes  eeeee"
b <- "TeststringB2ees   eeeee"
magical_string_processing(a,b)
[1] "Test"  "tring" "es"    "eee"

nice answer. However, when I use this function and input the two strings you gave in your comment to me above, `a <- "TestStringA1test"` and `b <- "TeststringB2es"` I get 3 matches: `"Test" "tring" "test"`. Surely the third "match" shouldn't be there? — Allan Cameron, Dec 31 '19 at 12:38
Nice indeed. The non-positionally constraint matching observed by @AllanCameron renders it not usable in my case, however. — balin, Dec 31 '19 at 13:10
@AllanCameron Yes, you are right, I realized that as soon as I commented on your solution. I made mine more robust... but also more complicated. Enough pointers for today :) — Radim, Dec 31 '19 at 13:39

score 0 · Answer 4 · answered Dec 31 '19 at 13:50

Removing the positionality constraint makes this much more complicated, but feasible:

magical_string_processing <- function(a, b)
{
  result <- character();
  broken_a <- unlist(strsplit(a, ""))
  broken_b <- unlist(strsplit(b, ""))
  list() -> all_matches
  for(i in 1:length(broken_a))
  {
    all_matches[[i]] <- which(broken_b == broken_a[i])
  }

  for(k in 1:(length(all_matches) - 1))
  {
    if(length(all_matches[[k]]) == 0) next
    for(i in seq_along(all_matches[[k]]))
    {
      char_matches <- all_matches[[k]]
      char_match <- char_matches[i]
      j <- k
      running_matches <- char_match
      for(j in (k + 1):length(all_matches))
      {
        next_row <- all_matches[[j]];
        if(length(next_row) == 0)
        {
          if(length(running_matches) > 1) result <- c(result, paste(broken_b[running_matches], collapse = ""))
          running_matches <- character()
          break
        }

        if(!(char_match + 1) %in% next_row)
        {
          if(length(running_matches) > 1) result <- c(result, paste(broken_b[running_matches], collapse = ""))
          running_matches <- character()
          break
        }

        running_matches <- c(running_matches, next_row[which(next_row == char_match + 1)])
        char_match <- running_matches[length(running_matches)]
        all_matches[[j]] <- all_matches[[j]][-which(next_row == char_match + 1)]
      }
    }
    if(length(running_matches > 0))
    {
       result <- c(result, paste(broken_b[running_matches], collapse = ""))
    }
  }
  return(result)
}

Now:

> a <- "  TestStringA"
> b <- "TeststringB"
> magical_string_processing(a, b)
# [1] "Test"  "tring"

This returns different results for `magical_string_processing(a, b)` and `magical_string_processing(b, a)` if defined as `a <- "TestStringA1testes"` and `b <- "TeststringB2ees"`. Depending on the task (not described in detail by OP), this may or may not be desirable. Anyway, solutions clearly exist :) — Radim, Dec 31 '19 at 13:59
Yes @Radim, it really depends on the requirements of the function. I think this solution could be tweaked to give the desired output quite easily though. — Allan Cameron, Dec 31 '19 at 14:06
@Radim For example, you could have a wrapper function that calls this function both ways round and returns only unique results. — Allan Cameron, Dec 31 '19 at 14:09

`r`: Efficiently identifying identical parts of two strings w/o prior knowledge of the pattern?

4 Answers4