Removing Elements from vector the amount of time it occurs in R

Question

I want to remove the elements from a vector the amount of time it occurs in my other vector. Like if I would substracting them. Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from.

a <- c("A", "B", "B", "C", "C", "C")
b <- c("A", "B", "C", "C")

a[! a %in% b] #returns character(0)

#expected result = "B" "C"

I don't want to use a library for this. I'd rather write a function if possible without loops. Is there a way to do so? Thank you in advance

Very possibly a duplicate of ["Set Difference" between two vectors with duplicate values](https://stackoverflow.com/questions/52941312/set-difference-between-two-vectors-with-duplicate-values) — thelatemail, Mar 23 '23 at 00:19

r2evans · Answer 1 · 2023-03-23T15:22:59.550

3

This may not be the most efficient, but

Reduce(function(prev, this) {
  ind <- match(this, prev)
  if (length(ind)) prev[-ind[1]] else prev
}, b, init = a)
# [1] "B" "C"

For fun, here's a non-Reduce variant (motivated by looking at AllanCameron's simpler answer) that preserves order. The added complexity is only worth it if preserving order is necessary.

finddiff2 <- function(A, B) {
  dict <- split(seq_along(A), A)
  tb <- table(B)
  nms <- intersect(names(tb), A)
  dict[nms] <- Map(tail, dict[nms], -tb[nms])
  A[sort(unlist(dict))]
}
finddiff2(a, b)
# [1] "B" "C"
finddiff2(rev(a), b)
# [1] "C" "B"
finddiff2(c("A","B"), "A")
# [1] "B"

The preservation is easier to see with a longer a:

a <- rep(c("A","B","C"), times = 4)
finddiff2(a, b)
# [1] "A" "B" "A" "B" "C" "A" "B" "C"
finddiff2(rev(a), b)
# [1] "B" "A" "C" "B" "A" "C" "B" "A"

edited Mar 23 '23 at 15:22

answered Mar 22 '23 at 23:26

r2evans

141,215
6
77
149

1

Ah, I get what you mean about the preserved ordering now. Thanks for the addition. – Allan Cameron Mar 22 '23 at 23:44
1

Try: `finddiff2(c("A", "B"), "A")` – GKi Mar 23 '23 at 13:33
Good find, fixed @GKi – r2evans Mar 23 '23 at 13:36
1

Maybe `A[sort(unlist(dict))]` instead of `rep(names(dict), lengths(dict))[order(unlist(dict))]`? To keep also the type (will e.g. work also with integer and will not convert to charter). – GKi Mar 23 '23 at 13:44
@GKi, that's a great point and recommendation, thanks. – r2evans Mar 23 '23 at 15:23

Onyambu · Answer 2 · 2023-03-23T15:14:40.487

2

in base R you could use pmatch:

a[-pmatch(b, a, 0)]
[1] "B" "C"

Note that in the above 0 is needed in case there was a value/level in b that does not exist in a

If all the elements in b are in a then the following is sufficient

a[-pmatch(b, a)]
[1] "B" "C"

NB

as @jblood pointed out, pmatch only works with vectors whose length is less than 100

edited Mar 23 '23 at 15:14

answered Mar 23 '23 at 01:03

Onyambu

67,392
3
24
53

2

This should come with a strong caveat of `length(a) < 101 | length(b) < 101`. Otherwise the result will be incorrect. Compare `pmatch(rep("a", 100), rep("a", 100))` to `pmatch(rep("a", 101), rep("a", 101))`. – jblood94 Mar 23 '23 at 14:51
@jblood94 so far I do not know as to why that is the case, though the behaviour is quite striking. The only thing i have found so far is the notion that [the target is not allowed to be long](https://github.com/wch/r-source/blob/d29fd8b7f3221aaef97f0980108a230623274442/src/main/unique.c#L1537) – Onyambu Mar 23 '23 at 15:06
Line 1603: https://github.com/wch/r-source/blob/d29fd8b7f3221aaef97f0980108a230623274442/src/main/unique.c#L1602 – jblood94 Mar 23 '23 at 15:07
1

@jblood94 Thats right. I missed that. haha. yes yes, it shows that both need to be less than 100 otherwise `pmatch` would just change to `charmatch` and not allow multiple exact matches – Onyambu Mar 23 '23 at 15:10

score 1 · Answer 3 · answered Mar 22 '23 at 23:31

1

If you want to define a simple function, you could do:

finddiff <- function(a, b) {
  levs <- unique(c(a, b))
  tab  <- table(factor(a, levs)) - table(factor(b, levs))
  tab  <- abs(tab[tab != 0])
  rep(names(tab), tab)
}

finddiff(a, b)
#> [1] "B" "C"

answered Mar 22 '23 at 23:31

Allan Cameron

147,086
7
49
87

1

I thought about subtracting tables like that, nice approach. The only advantage `Reduce` has is that it preserves order, which I feared (without verification) that `table(.) - table(.)` would not. Nice use of `factor` in there, btw. – r2evans Mar 22 '23 at 23:32
Thanks @r2evans. The ordering in tables is based on factor levels, so we are guaranteed to get matching tables as long as we specify that a and b are factors with the same levels harvested from the unique values of both vectors. I never thought to use Reduce, but it's a neat idea too. – Allan Cameron Mar 22 '23 at 23:42
`finddiff` doesn't preserve order when the letters are not sorted, but the OP never stated that as a requirement, I assumed it (for the challenge). – r2evans Mar 22 '23 at 23:44
Maybe `tab[tab > 0]` instead of `abs(tab[tab != 0])` – GKi Mar 23 '23 at 09:54
@GKi I guess it depends on what you are trying to extract. `tab[tab > 0]` would get all elements of `a` not in `b`, i.e. **A - B** but `abs(tab[tab != 0])` gets all elements of `a` and `b` that are not part of the intersection, i.e. **A∪B - A∩B**. Both are the same in this example of course, so the OP's aim was open to interpretation. – Allan Cameron Mar 23 '23 at 10:25
I would interpret *I want to remove the elements* as `A - B`. But yes with: *Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from.* it will not matter. – GKi Mar 23 '23 at 10:33

jblood94 · Answer 4 · 2023-03-23T12:42:25.357

1

Using a data.table anti-join with rowid:

library(data.table)
data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]]
#> [1] "B" "C"

Testing it on larger vectors:

set.seed(2041082007)
a <- stringi::stri_rand_strings(2e5, 2)
b <- sample(a, 1e5)
system.time(ab1 <- data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]])
#>    user  system elapsed 
#>    0.00    0.02    0.01

Compare to the pmatch solution from this answer:

system.time(ab2 <- a[-pmatch(b, a, 0)])
#>    user  system elapsed 
#>   46.53    0.00   46.56

Additionally, pmatch does not seem to behave correctly for this problem:

all.equal(ab1, ab2)
#> [1] "Lengths (100000, 196156) differ (string compare on first 100000)"
#> [2] "99979 string mismatches"

pmatch is returning a much larger vector than expected. Get the difference between the two answers:

ab12 <- data.table(ab2, rowid(ab2))[!data.table(ab1, rowid(ab1)), on = .(ab2 = ab1, V2)][[1]]

Check what is happening with the first element of ab12.

ab12[1]
#> [1] "28"
sum(a == ab12[1])
#> [1] 57
sum(b == ab12[1])
#> [1] 45

"28" appears 57 times in a and 45 times in b, so the result should have 12 instances of "28" as was returned by the anti-join.

sum(ab1 == ab12[1])
#> [1] 12

The pmatch solution, however, erroneously returns a vector that has 56 instances of "28".

sum(ab2 == ab12[1])
#> [1] 56

edited Mar 23 '23 at 12:42

answered Mar 23 '23 at 12:15

jblood94

10,340
1
10
15

`pmatch` will also match parts of a string, but this should not be a problem when as given in the question: *Every element in my vector of elements I want to remove is also existing in the main vector I want to remove*. – GKi Mar 23 '23 at 12:27
At first I was thinking it had to do with partial matching, but it doesn't seem to be the case. It seems to have to do with the vector sizes. Try `set.seed(1); a <- sample(LETTERS, 1e3, 1); b <- sample(a, 5e2, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A")`. – jblood94 Mar 23 '23 at 12:36
@GKi, on the other hand, it seems to work ok for smaller vectors: `set.seed(1); a <- sample(LETTERS, 200, 1); b <- sample(a, 100, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A")`. – jblood94 Mar 23 '23 at 12:43
Yes you are right! `pmatch(rep("a", 100), rep("a", 100))` works in my case, while `pmatch(rep("a", 101), rep("a", 101))` does not. – GKi Mar 23 '23 at 12:52
Yep. 101 seems to be the transition point. I'm trying to find the .Internal code for `pmatch`, but I'm having a hard time. – jblood94 Mar 23 '23 at 13:06
Maybe because it "comes from" *argument matching* and typical there are less than 100 arguments...? – GKi Mar 23 '23 at 13:10
1

See line 1603 here: https://github.com/wch/r-source/blob/d29fd8b7f3221aaef97f0980108a230623274442/src/main/unique.c – jblood94 Mar 23 '23 at 13:48

score 0 · Answer 5 · answered Mar 23 '23 at 00:39

0

c <- data.frame(table(a) - table(b))
tidyr::uncount(c, Freq)$a

Result

[1] B C
Levels: A B C

answered Mar 23 '23 at 00:39

Jon Spring

55,165
4
35
53

Removing Elements from vector the amount of time it occurs in R

5 Answers5

NB

Linked