Extract characters that differ between two strings

Question

I have used adist to calculate the number of characters that differ between two strings:

a <- "Happy day"
b <- "Tappy Pay"
adist(a,b) # result 2

Now I would like to extract those character that differ. In my example, I would like to get the string "Hd" (or "TP", it doesn't matter).

I tried to look in adist, agrep and stringi but found nothing.

I suggest you undo the edit and ask a new question. In this new question you'll have to give much more information about your real data. For example, it matters hugely whether you know that the different string is at the start vs. at the end of the string. You also have to tell us if your problem relates at all to the [longest common substring problem](http://en.wikipedia.org/wiki/Longest_common_substring_problem). — Andrie, Mar 03 '15 at 20:27
Agreed, undo the edit, accept the best answer, and ask a new question. The question is substantively different, and a lot of people have put in a lot of work already. — BrodieG, Mar 03 '15 at 20:28

score 30 · Accepted Answer · answered Mar 03 '15 at 14:44

30

You can use the following sequence of operations:

split the string using strsplit().
Use setdiff() to compare the elements
Wrap in a reducing function

Try this:

Reduce(setdiff, strsplit(c(a, b), split = ""))
[1] "H" "d"

answered Mar 03 '15 at 14:44

Andrie

176,377
47
447
496

`do.call(setdiff, strsplit(c(a, b), split = ""))` will be probably more efficient – David Arenburg Mar 03 '15 at 14:46
Second arg to `strsplit` is `split` so you don't need to name it if you want to get down in fewer shots. – Spacedman Mar 03 '15 at 14:47
1

@DavidArenburg Not if you're playing code golf. `Reduce` is one less keystroke than `do.call` :-) – Andrie Mar 03 '15 at 14:47
2

@Spacedman As the number of answers to a programming question grows, the probability of it degenerating into code golf approaches 1. – James Mar 03 '15 at 15:05

score 7 · Answer 2 · answered Mar 03 '15 at 14:43

7

Split into letters and take the difference as sets:

> setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]])
[1] "H" "d"

answered Mar 03 '15 at 14:43

Spacedman

92,590
12
140
224

score 5 · Answer 3 · answered Mar 03 '15 at 14:43

5

Not really proud of this, but it seems to do the job:

sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8)

Results:

[1] "H" "d"

answered Mar 03 '15 at 14:43

JasonAizkalns

20,243
8
57
116

4

That's a nice one. You could probably vectorize it by `intToUtf8(setdiff(utf8ToInt(a), utf8ToInt(b)))` – David Arenburg Mar 03 '15 at 14:44
You might not be proud, but this helped my find a "non breaking space" instead of a regular space by comparing unicode integers. – James Mar 03 '23 at 16:14

G. Grothendieck · Answer 4 · 2015-03-03T20:17:06.830

As long as a and b have the same length we can do this:

s.a <- strsplit(a, "")[[1]]
s.b <- strsplit(b, "")[[1]]
paste(s.a[s.a != s.b], collapse = "")

giving:

[1] "Hd"

This seems straightforward in terms of clarity of the code and seems tied for the fastest of the solutions provided here although I think I prefer f3:

f1 <- function(a, b)
  paste(setdiff(strsplit(a,"")[[1]],strsplit(b,"")[[1]]), collapse = "")

f2 <- function(a, b)
  paste(sapply(setdiff(utf8ToInt(a), utf8ToInt(b)), intToUtf8), collapse = "")

f3 <- function(a, b) 
  paste(Reduce(setdiff, strsplit(c(a, b), split = "")), collapse = "")

f4 <- function(a, b) {
  s.a <- strsplit(a, "")[[1]]
  s.b <- strsplit(b, "")[[1]]
  paste(s.a[s.a != s.b], collapse = "")
}

a <- "Happy day"
b <- "Tappy Pay"

library(rbenchmark)
benchmark(f1, f2, f3, f4, replications = 10000, order = "relative")[1:4]

giving the following on a fresh session on my laptop:

  test replications elapsed relative
3   f3        10000    0.07    1.000
4   f4        10000    0.07    1.000
1   f1        10000    0.09    1.286
2   f2        10000    0.10    1.429

I have assumed that the differences must be in the corresponding character positions. You might want to clarify if that is the intention or not.

score 3 · Answer 5 · answered Mar 03 '15 at 14:58

3

You can use one of the variables as a regex character class and gsub out from the other one:

gsub(paste0("[",a,"]"),"",b)
[1] "TP"
gsub(paste0("[",b,"]"),"",a)
[1] "Hd"

answered Mar 03 '15 at 14:58

James

65,548
14
155
193

Does that work if the strings have regexpy-special chars in them? – Spacedman Mar 03 '15 at 15:30
@Spacedman Yes, good catch, special character class regex, such as `^` and `-` may cause issues. This could be a particular issue with hyphenated words. – James Mar 03 '15 at 15:53

score 1 · Answer 6 · answered Jun 22 '18 at 08:23

The following function could be a better option to solve problem like this.

list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info

from https://www.r-bloggers.com/extract-different-characters-between-two-strings-of-equal-length/

Then you can run

list.string.diff(a, b)

to get the difference.

Extract characters that differ between two strings

6 Answers6

Linked

Related