16

I have used adist to calculate the number of characters that differ between two strings:

a <- "Happy day"
b <- "Tappy Pay"
adist(a,b) # result 2

Now I would like to extract those character that differ. In my example, I would like to get the string "Hd" (or "TP", it doesn't matter).

I tried to look in adist, agrep and stringi but found nothing.

Dario Lacan
  • 1,099
  • 1
  • 11
  • 25
  • 2
    I suggest you undo the edit and ask a new question. In this new question you'll have to give much more information about your real data. For example, it matters hugely whether you know that the different string is at the start vs. at the end of the string. You also have to tell us if your problem relates at all to the [longest common substring problem](http://en.wikipedia.org/wiki/Longest_common_substring_problem). – Andrie Mar 03 '15 at 20:27
  • 1
    Agreed, undo the edit, accept the best answer, and ask a new question. The question is substantively different, and a lot of people have put in a lot of work already. – BrodieG Mar 03 '15 at 20:28

6 Answers6

30

You can use the following sequence of operations:

  • split the string using strsplit().
  • Use setdiff() to compare the elements
  • Wrap in a reducing function

Try this:

Reduce(setdiff, strsplit(c(a, b), split = ""))
[1] "H" "d"
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • `do.call(setdiff, strsplit(c(a, b), split = ""))` will be probably more efficient – David Arenburg Mar 03 '15 at 14:46
  • Second arg to `strsplit` is `split` so you don't need to name it if you want to get down in fewer shots. – Spacedman Mar 03 '15 at 14:47
  • 1
    @DavidArenburg Not if you're playing code golf. `Reduce` is one less keystroke than `do.call` :-) – Andrie Mar 03 '15 at 14:47
  • 2
    @Spacedman As the number of answers to a programming question grows, the probability of it degenerating into code golf approaches 1. – James Mar 03 '15 at 15:05
7

Split into letters and take the difference as sets:

> setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]])
[1] "H" "d"
Spacedman
  • 92,590
  • 12
  • 140
  • 224
5

Not really proud of this, but it seems to do the job:

sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8)

Results:

[1] "H" "d"
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • 4
    That's a nice one. You could probably vectorize it by `intToUtf8(setdiff(utf8ToInt(a), utf8ToInt(b)))` – David Arenburg Mar 03 '15 at 14:44
  • You might not be proud, but this helped my find a "non breaking space" instead of a regular space by comparing unicode integers. – James Mar 03 '23 at 16:14
4

As long as a and b have the same length we can do this:

s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")

giving:

[1] "Hd"

This seems straightforward in terms of clarity of the code and seems tied for the fastest of the solutions provided here although I think I prefer f3:

f1 <- function(a, b)
  paste(setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]]), collapse = "")

f2 <- function(a, b)
  paste(sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8), collapse = "")

f3 <- function(a, b) 
  paste(Reduce(setdiff, strsplit(c(a, b), split = "")), collapse = "")

f4 <- function(a, b) {
  s.a <- strsplit(a, "")[[1]]
  s.b <- strsplit(b, "")[[1]]
  paste(s.a[s.a != s.b], collapse = "")
}

a <- "Happy day"
b <- "Tappy Pay"

library(rbenchmark)
benchmark(f1, f2, f3, f4, replications = 10000, order = "relative")[1:4]

giving the following on a fresh session on my laptop:

  test replications elapsed relative
3   f3        10000    0.07    1.000
4   f4        10000    0.07    1.000
1   f1        10000    0.09    1.286
2   f2        10000    0.10    1.429

I have assumed that the differences must be in the corresponding character positions. You might want to clarify if that is the intention or not.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

You can use one of the variables as a regex character class and gsub out from the other one:

gsub(paste0("[",a,"]"),"",b)
[1] "TP"
gsub(paste0("[",b,"]"),"",a)
[1] "Hd"
James
  • 65,548
  • 14
  • 155
  • 193
  • Does that work if the strings have regexpy-special chars in them? – Spacedman Mar 03 '15 at 15:30
  • @Spacedman Yes, good catch, special character class regex, such as `^` and `-` may cause issues. This could be a particular issue with hyphenated words. – James Mar 03 '15 at 15:53
1

The following function could be a better option to solve problem like this.

list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info

from https://www.r-bloggers.com/extract-different-characters-between-two-strings-of-equal-length/

Then you can run

list.string.diff(a, b)

to get the difference.

Shixiang Wang
  • 2,147
  • 2
  • 24
  • 33