Extract distinct characters that differ between two strings

Question

I have two strings, a <- "AERRRTX"; b <- "TRRA" .

I want to extract the characters in a not used in b, i.e. "ERX"

I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.

Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?

Ronak Shah · Accepted Answer · 2016-03-23T09:04:25.857

11

A different approach using pmatch,

a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, "")) 

a1[!1:length(a1) %in% pmatch(b1, a1)]

 #[1] "E" "R" "X"

Another example,

a <- "Ronak";b<-"Shah"

a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]

# [1] "R" "o" "n" "k"

edited Mar 23 '16 at 09:04

answered Mar 23 '16 at 08:28

Ronak Shah

377,200
20
156
213

1

Minor point: It's advisable to avoid assigning `c`, since it's a commonly used built-in function. If `c` is a variable defined in any enclosing environment, references to that identifier can bind to it, which can mess up a lot of code. For example, `do.call(c,...)` fails in this case. – bgoldst Mar 23 '16 at 08:42
2

Nice alternative. You could replace your third line with `a1[-pmatch(b1, a1)]`. Also, it would be useful to note the "duplicates.ok = FALSE" argument of `pmatch` which differentiates its behaviour to `match` – alexis_laz Mar 23 '16 at 09:19

bgoldst · Answer 2 · 2016-03-23T08:39:41.307

We can use Reduce() to successively eliminate from a each character found in b:

a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"

This will preserve the order of the surviving characters in a.

Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():

a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"

score 4 · Answer 3 · answered Mar 23 '16 at 07:59

4

You can use the function vsetdiff from vecsets package

install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"  
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"

answered Mar 23 '16 at 07:59

Robert Plant

61
3

Colonel Beauvel · Answer 4 · 2016-03-23T08:54:37.560

3

An alternative using data.table package`:

library(data.table)

x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))

dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]

rep(dt$V1, dt$res)
#[1] "E" "R" "X"

edited Mar 23 '16 at 08:54

answered Mar 23 '16 at 08:13

Colonel Beauvel

30,423
11
47
87

Extract distinct characters that differ between two strings

4 Answers4