3

I want to remove the elements from a vector the amount of time it occurs in my other vector. Like if I would substracting them. Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from.

a <- c("A", "B", "B", "C", "C", "C")
b <- c("A", "B", "C", "C")

a[! a %in% b] #returns character(0)

#expected result = "B" "C"

I don't want to use a library for this. I'd rather write a function if possible without loops. Is there a way to do so? Thank you in advance

shnike
  • 41
  • 1
  • 4
    Very possibly a duplicate of ["Set Difference" between two vectors with duplicate values](https://stackoverflow.com/questions/52941312/set-difference-between-two-vectors-with-duplicate-values) – thelatemail Mar 23 '23 at 00:19

5 Answers5

3

This may not be the most efficient, but

Reduce(function(prev, this) {
  ind <- match(this, prev)
  if (length(ind)) prev[-ind[1]] else prev
}, b, init = a)
# [1] "B" "C"

For fun, here's a non-Reduce variant (motivated by looking at AllanCameron's simpler answer) that preserves order. The added complexity is only worth it if preserving order is necessary.

finddiff2 <- function(A, B) {
  dict <- split(seq_along(A), A)
  tb <- table(B)
  nms <- intersect(names(tb), A)
  dict[nms] <- Map(tail, dict[nms], -tb[nms])
  A[sort(unlist(dict))]
}
finddiff2(a, b)
# [1] "B" "C"
finddiff2(rev(a), b)
# [1] "C" "B"
finddiff2(c("A","B"), "A")
# [1] "B"

The preservation is easier to see with a longer a:

a <- rep(c("A","B","C"), times = 4)
finddiff2(a, b)
# [1] "A" "B" "A" "B" "C" "A" "B" "C"
finddiff2(rev(a), b)
# [1] "B" "A" "C" "B" "A" "C" "B" "A"
r2evans
  • 141,215
  • 6
  • 77
  • 149
2

in base R you could use pmatch:

a[-pmatch(b, a, 0)]
[1] "B" "C"

Note that in the above 0 is needed in case there was a value/level in b that does not exist in a

If all the elements in b are in a then the following is sufficient

a[-pmatch(b, a)]
[1] "B" "C"

NB

as @jblood pointed out, pmatch only works with vectors whose length is less than 100

Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • 2
    This should come with a strong caveat of `length(a) < 101 | length(b) < 101`. Otherwise the result will be incorrect. Compare `pmatch(rep("a", 100), rep("a", 100))` to `pmatch(rep("a", 101), rep("a", 101))`. – jblood94 Mar 23 '23 at 14:51
  • @jblood94 so far I do not know as to why that is the case, though the behaviour is quite striking. The only thing i have found so far is the notion that [the target is not allowed to be long](https://github.com/wch/r-source/blob/d29fd8b7f3221aaef97f0980108a230623274442/src/main/unique.c#L1537) – Onyambu Mar 23 '23 at 15:06
  • Line 1603: https://github.com/wch/r-source/blob/d29fd8b7f3221aaef97f0980108a230623274442/src/main/unique.c#L1602 – jblood94 Mar 23 '23 at 15:07
  • 1
    @jblood94 Thats right. I missed that. haha. yes yes, it shows that both need to be less than 100 otherwise `pmatch` would just change to `charmatch` and not allow multiple exact matches – Onyambu Mar 23 '23 at 15:10
1

If you want to define a simple function, you could do:

finddiff <- function(a, b) {
  levs <- unique(c(a, b))
  tab  <- table(factor(a, levs)) - table(factor(b, levs))
  tab  <- abs(tab[tab != 0])
  rep(names(tab), tab)
}

finddiff(a, b)
#> [1] "B" "C"
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    I thought about subtracting tables like that, nice approach. The only advantage `Reduce` has is that it preserves order, which I feared (without verification) that `table(.) - table(.)` would not. Nice use of `factor` in there, btw. – r2evans Mar 22 '23 at 23:32
  • Thanks @r2evans. The ordering in tables is based on factor levels, so we are guaranteed to get matching tables as long as we specify that a and b are factors with the same levels harvested from the unique values of both vectors. I never thought to use Reduce, but it's a neat idea too. – Allan Cameron Mar 22 '23 at 23:42
  • `finddiff` doesn't preserve order when the letters are not sorted, but the OP never stated that as a requirement, I assumed it (for the challenge). – r2evans Mar 22 '23 at 23:44
  • Maybe `tab[tab > 0]` instead of `abs(tab[tab != 0])` – GKi Mar 23 '23 at 09:54
  • @GKi I guess it depends on what you are trying to extract. `tab[tab > 0]` would get all elements of `a` not in `b`, i.e. **A - B** but `abs(tab[tab != 0])` gets all elements of `a` and `b` that are not part of the intersection, i.e. **A∪B - A∩B**. Both are the same in this example of course, so the OP's aim was open to interpretation. – Allan Cameron Mar 23 '23 at 10:25
  • I would interpret *I want to remove the elements* as `A - B`. But yes with: *Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from.* it will not matter. – GKi Mar 23 '23 at 10:33
1

Using a data.table anti-join with rowid:

library(data.table)
data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]]
#> [1] "B" "C"

Testing it on larger vectors:

set.seed(2041082007)
a <- stringi::stri_rand_strings(2e5, 2)
b <- sample(a, 1e5)
system.time(ab1 <- data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]])
#>    user  system elapsed 
#>    0.00    0.02    0.01

Compare to the pmatch solution from this answer:

system.time(ab2 <- a[-pmatch(b, a, 0)])
#>    user  system elapsed 
#>   46.53    0.00   46.56

Additionally, pmatch does not seem to behave correctly for this problem:

all.equal(ab1, ab2)
#> [1] "Lengths (100000, 196156) differ (string compare on first 100000)"
#> [2] "99979 string mismatches"

pmatch is returning a much larger vector than expected. Get the difference between the two answers:

ab12 <- data.table(ab2, rowid(ab2))[!data.table(ab1, rowid(ab1)), on = .(ab2 = ab1, V2)][[1]]

Check what is happening with the first element of ab12.

ab12[1]
#> [1] "28"
sum(a == ab12[1])
#> [1] 57
sum(b == ab12[1])
#> [1] 45

"28" appears 57 times in a and 45 times in b, so the result should have 12 instances of "28" as was returned by the anti-join.

sum(ab1 == ab12[1])
#> [1] 12

The pmatch solution, however, erroneously returns a vector that has 56 instances of "28".

sum(ab2 == ab12[1])
#> [1] 56
jblood94
  • 10,340
  • 1
  • 10
  • 15
  • `pmatch` will also match parts of a string, but this should not be a problem when as given in the question: *Every element in my vector of elements I want to remove is also existing in the main vector I want to remove*. – GKi Mar 23 '23 at 12:27
  • At first I was thinking it had to do with partial matching, but it doesn't seem to be the case. It seems to have to do with the vector sizes. Try `set.seed(1); a <- sample(LETTERS, 1e3, 1); b <- sample(a, 5e2, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A")`. – jblood94 Mar 23 '23 at 12:36
  • @GKi, on the other hand, it seems to work ok for smaller vectors: `set.seed(1); a <- sample(LETTERS, 200, 1); b <- sample(a, 100, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A")`. – jblood94 Mar 23 '23 at 12:43
  • Yes you are right! `pmatch(rep("a", 100), rep("a", 100))` works in my case, while `pmatch(rep("a", 101), rep("a", 101))` does not. – GKi Mar 23 '23 at 12:52
  • Yep. 101 seems to be the transition point. I'm trying to find the .Internal code for `pmatch`, but I'm having a hard time. – jblood94 Mar 23 '23 at 13:06
  • Maybe because it "comes from" *argument matching* and typical there are less than 100 arguments...? – GKi Mar 23 '23 at 13:10
  • 1
    See line 1603 here: https://github.com/wch/r-source/blob/d29fd8b7f3221aaef97f0980108a230623274442/src/main/unique.c – jblood94 Mar 23 '23 at 13:48
0
c <- data.frame(table(a) - table(b))
tidyr::uncount(c, Freq)$a

Result

[1] B C
Levels: A B C
Jon Spring
  • 55,165
  • 4
  • 35
  • 53