8

I have two strings, a <- "AERRRTX"; b <- "TRRA" .

I want to extract the characters in a not used in b, i.e. "ERX"

I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.

Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?

Community
  • 1
  • 1
Ricky
  • 4,616
  • 6
  • 42
  • 72

4 Answers4

11

A different approach using pmatch,

a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, "")) 

a1[!1:length(a1) %in% pmatch(b1, a1)]

 #[1] "E" "R" "X"

Another example,

a <- "Ronak";b<-"Shah"

a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]

# [1] "R" "o" "n" "k"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Minor point: It's advisable to avoid assigning `c`, since it's a commonly used built-in function. If `c` is a variable defined in any enclosing environment, references to that identifier can bind to it, which can mess up a lot of code. For example, `do.call(c,...)` fails in this case. – bgoldst Mar 23 '16 at 08:42
  • 2
    Nice alternative. You could replace your third line with `a1[-pmatch(b1, a1)]`. Also, it would be useful to note the "duplicates.ok = FALSE" argument of `pmatch` which differentiates its behaviour to `match` – alexis_laz Mar 23 '16 at 09:19
4

We can use Reduce() to successively eliminate from a each character found in b:

a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"

This will preserve the order of the surviving characters in a.


Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():

a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"
bgoldst
  • 34,190
  • 6
  • 38
  • 64
4

You can use the function vsetdiff from vecsets package

install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"  
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"
3

An alternative using data.table package`:

library(data.table)

x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))

dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]

rep(dt$V1, dt$res)
#[1] "E" "R" "X"
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87