Extract string differences including repeated characters

Question

I am trying to extract the difference between two strings, to see if anything was added between texts. However, the answers I could find all suggest using setdiff(), which would not count more than one instance of a list.

Running:

list(setdiff(strsplit("what the hel","")[[1]],strsplit("what the h","")[[1]]))

returns

[[1]]
[1] "l"

whereas I would expect c("e", "l")

Is there a different function I should be using?

See this answer: https://stackoverflow.com/a/28834641/6461462 — M--, May 12 '23 at 20:33
@M-- the top answers don't work (try `Reduce(setdiff, strsplit(c("what the h", "what the hel"), split = ""))`), and the rest assume things like same-length. Perhaps this would be a good candidate for improving the other answers. — r2evans, May 12 '23 at 20:36
This only applies to strings of the same length, or relies on `setdiff()` — Adam_G, May 12 '23 at 20:36
@r2evans Not the top answer, the one I linked; it works: https://i.stack.imgur.com/bXx3U.png — M--, May 12 '23 at 20:37
@m-- that's fragile, and the warning is incomplete at saying "why": **recycling**. If `b` instead is `b <- "what the hwh"`, then it returns `character(0)`. In that case, it is silently wrong. I suggest that method needs added safeguards, which is why I posed my answer. (If there's another q/a that is better as a dupe, I'm not arguing that mine is awesome, just that the others I know of don't work for this question.) — r2evans, May 12 '23 at 20:45
@r2evans that's not right. It works for your example as well: https://i.stack.imgur.com/pv0hG.png The only issue would be when `b` is longer than `a`. In that case, it returns bunch of `NA` at the end. — M--, May 12 '23 at 20:48
@r2evans I actually think your answer is good (now that I tried it). But nonetheless this is a dupe. As I explained above. Closing this as a dupe is not a testament against your quality answer. Cheers. — M--, May 12 '23 at 20:54
the bottom line is that when one is longer, it recycles, and that's a logical flaw, and I think you missed my point with that second image: `b` is two letters longer and those two letters are the same as `a`'s first two letters. The the original, ala `a <- "what the h"; b <- "what the hwh"; a1 <- strsplit(a, "")[[1]]; b1 <- strsplit(b, "")[[1]]; b1[a1!=b1]`; then do `b <- "what the hwh"` and run the rest again. — r2evans, May 12 '23 at 22:00

r2evans · Answer 1 · 2023-05-13T13:26:31.103

The problem with setdiff is that it's working on a "set", which assumes that the presence of more than one will be reduced (give or take).

The dupe-link solution is incomplete, and if the strings are different lengths, it can return a false-negative.

Using that code,

a <- "what the h"
b <- "what the hel"
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
s.b[s.b != s.a]
# Warning in s.b != s.a :
#   longer object length is not a multiple of shorter object length
# [1] "e" "l"

This result is correct, but what if instead b ended differently:

a <- "what the h"
b <- "what the hwh"
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
s.b[s.b != s.a]
# Warning in s.b != s.a :
#   longer object length is not a multiple of shorter object length
# character(0)

This incorrectly returns character(0) because R is recycling s.a to be the same length as s.b, and since the length difference is two, and the first two letters of a are the same as the last two letters of b, it is finding no differences.

<rant> Recycling can be useful and a neat trick, but it causes problems often enough that in my opinion it should be an error, or at least something we can turn into an error via options. </rant>

The only way around this is to compare the lengths up to the shorter of the two strings, and then append the differences beyond that.

If we aren't certain which is longer, a more complete (yet still admittedly crude) answer might be

a <- "what the h"
b <- "what the hel"
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
common <- min(nchar(a), nchar(b))
c(s.b[1:common][ s.b[1:common] != s.a[1:common] ],
  if (length(s.a) > common) s.a[-(1:common)],
  if (length(s.b) > common) s.b[-(1:common)])
# [1] "e" "l"

and the unlikely case in my counter-example above also works as one might expect:

a <- "what the h"
b <- "what the hwh"
s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
common <- min(nchar(a), nchar(b))
c(s.b[1:common][ s.b[1:common] != s.a[1:common] ],
  if (length(s.a) > common) s.a[-(1:common)],
  if (length(s.b) > common) s.b[-(1:common)])
# [1] "w" "h"

Agreed that this is pretty crude. I appreciate the answer, but is there really not a more elegant solution out there? — Adam_G, May 12 '23 at 20:14
This is really impressive. Thank you for all of this! How would I wrap it in a function so I could use it with `mutate()`? — Adam_G, May 15 '23 at 19:28
Are you looking for this 1-to-1 comparison to return a string `"el"` or a list-column instead? Regardless, you'll need to remove the `[[1]]` from each `strsplit`, then start doing some vectorized stuff (since this code was meant to answer your question, comparing 1-to-1). — r2evans, May 15 '23 at 19:33
Ok, that makes sense. I'm looking for a returned string. I'm using `rowwise()` just so I don't get in any trouble moving from row to row. — Adam_G, May 15 '23 at 19:55
I'm assuming you don't have a lot of data ... with _many_ rows, `rowwise()` is slower and can likely be avoided. — r2evans, May 15 '23 at 20:07

Extract string differences including repeated characters

1 Answers1