84

I have two character vectors of IDs.

I would like to compare the two character vectors, in particular I am interested in the following figures:

  • How many IDs are both in A and B
  • How many IDs are in A but not in B
  • How many IDs are in B but not in A

I would also love to draw a Venn diagram.

Nairolf
  • 2,418
  • 20
  • 34
Aslan986
  • 9,984
  • 11
  • 44
  • 75
  • 7
    see `??intersect` and `??setdiff`... – agstudy Jul 11 '13 at 16:01
  • 2
    see [Venn Diagrams with R?](http://stackoverflow.com/q/1428946/59470) – topchef Jul 11 '13 at 16:37
  • 7
    isn't this an incorrect use of the term "list" in R? This is just two vectors. That's not the same at all. – emilBeBri Jan 30 '17 at 22:09
  • 2
    @Florian I agree "list" is wrong in R terms, but it is what the OP thought was right. If others have the same wrong idea and search from google, they could correctly land here. For this reason, I am usually conservative in correcting wrong terms in questions. Anyway, just something to maybe keep in mind if you are on an editing spree. (Btw, I use "set" in an answer below, because conceptually, that is how the vector is being treated here.) – Frank Jul 19 '19 at 16:49

7 Answers7

135

Here are some basics to try out:

> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE  TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog"   "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion" 

Similarly, you could get counts simply as:

> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
Mittenchops
  • 18,633
  • 33
  • 128
  • 246
22

I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:

xtab_set <- function(A,B){
    both    <-  union(A,B)
    inA     <-  both %in% A
    inB     <-  both %in% B
    return(table(inA,inB))
}

set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)

#        inB
# inA     FALSE TRUE
#   FALSE     0    5
#   TRUE      6    3
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Ah, I didn't realize Venn diagrams contained counts...I thought they were supposed to show the items themselves. – Frank Jul 11 '13 at 16:54
15

Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.

first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")

both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)

#> both
#[1] "2"   "3"   "e"   "f"   "bar"
#> onlyfirst
#[1] "1"   "a"   "b"   "c"   "d"   "foo"
#> onlysecond
#[1] "4"   "g"   "h"   "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4

# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))

Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.

venn diagram with gplots

Teemu Daniel Laajala
  • 2,316
  • 1
  • 26
  • 37
4

With sqldf: Slower but very suitable for data frames with mixed types:

t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2 
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1 
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2

sqldf1  X1_10
1
2
3
4
sqldf2   X5_15
11
12
13
14
15
sqldf3   X1_10
1
2 
3 
4 
5 
6 
7
8
9
10
11
12
13      
14
15
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
rferrisx
  • 1,598
  • 2
  • 12
  • 14
3

Using the same example data as one of the answers above.

A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")

match(A,B)
[1] NA  3 NA

The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.

To get the matching values in A and B, you can do:

m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"

To get the non-matching values in A and B:

A[is.na(m)]
"Dog"   "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"

Further, you can use length() to get the total number of matching and non-matching values.

milan
  • 4,782
  • 2
  • 21
  • 39
1

If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows

A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))

and B is a list with vector of primitive entries, e.g. created as follows

B<-list(c("ghi","zyx"))

and you're attempting to find which (if any) element of A$a matches B

A[sapply(a,identical,unlist(B))]

if you just want the entry in a

A[sapply(a,identical,unlist(B)),a]

if you want the matching indicies of a

A[,which(sapply(a,identical,unlist(B)))]

if instead B is itself a data.table with the same structure as A, e.g.

B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))

and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.

# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
                                      x %in% unlist(lapply(y,as.vector,mode="character"))
                                  },list(B[,b]),simplify=FALSE)))
  ][res==TRUE
  ][,res:=NULL][] 

# get T/F for each index of A
A[,sapply(list(a),function(x,y){
                      x %in% unlist(lapply(y,as.vector,mode="character"))
                  },list(B[,b]),simplify=FALSE)]

Note that you can't do something as easy as

setkey(A,a)
setkey(B,b)
A[B]

to join A&B because you cannot key on a field of type list in data.table 1.12.2

similarly, you cannot ask

A[a==B[,b]]

even if A and B are identical, as the == operator hasn't been implemented in R for type list

mpag
  • 531
  • 7
  • 19
  • 1
    Base R + data.table doesn't have a function called `simplify`. Maybe put required library() calls before the other code? – Frank Jun 07 '19 at 15:04
  • 1
    thanks for the catch. looks like it's part of `purrr`, which is a component of hadley's tidyverse. It seems to just be a call to `unlist` in this context, so will replace – mpag Jun 07 '19 at 19:37
1

You can type help(sets) in r console to check the documentation for different set operations using base r commands: union, intersection, (asymmetric!) difference, equality and membership on two vectors.

Examples from the documentation:

(x <- c(sort(sample(1:20, 9)), NA))
(y <- c(sort(sample(3:23, 7)), NA))
union(x, y)
intersect(x, y)
setdiff(x, y)
setdiff(y, x)
setequal(x, y)

## True for all possible x & y :
setequal( union(x, y),
          c(setdiff(x, y), intersect(x, y), setdiff(y, x)))

is.element(x, y) # length 10
is.element(y, x) # length  8
rez
  • 290
  • 2
  • 12