3

I have a data.frame that looks like this:

> DF1      

 A    B    C    D    E     
 a    x    c    h    p 
 c    d    q    t    w
 s    e    r    p    a
 w    l    t    s    i
 p    i    y    a    f

I would like to compare each column of my data.frame with the remaining columns in order to count the number of common elements. For example, I would like to compare column A with all the remaining columns (B, C, D, E) and count the common entities in this way:

A versus the remaining:

  • A vs B: 0 (because they have 0 common elements)
  • A vs C: 1 (c in common)
  • A vs D: 2 (p and s in common)
  • A vs E: 3 (p,w,a, in common)

Then the same: B versus C,D,E columns and so on.

How can I implement this?

halfer
  • 19,824
  • 17
  • 99
  • 186
Bfu38
  • 1,081
  • 1
  • 8
  • 17

2 Answers2

3

We can loop through the column names and compare with the other columns, by taking the intersect and get the length

sapply(names(DF1), function(x) {
    x1 <- lengths(Map(intersect, DF1[setdiff(names(DF1), x)], DF1[x]))
    c(x1, setNames(0, setdiff(names(DF1), names(x1))))[names(DF1)]})
#  A B C D E
#A 0 0 1 3 3
#B 0 0 0 0 1
#C 1 0 0 1 0
#D 3 0 1 0 2 
#E 3 1 0 2 0

Or this can be done more compactly by taking the cross product after getting the frequency of the long formatted (melt) dataset

library(reshape2)
tcrossprod(table(melt(as.matrix(DF1))[-1])) * !diag(5)
#    Var2
#Var2 A B C D E
#   A 0 0 1 3 3
#   B 0 0 0 0 1
#   C 1 0 0 1 0
#   D 3 0 1 0 2
#   E 3 1 0 2 0

NOTE: The crossprod part is also implemented with RcppEigen here which would make this faster

Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
1

An alternative is to use combn twice, once to get the column combinations and next to find the lengths of the element intersections.

cbind.data.frame returns a data.frame and setNames is used to add column names.

setNames(cbind.data.frame(t(combn(names(df), 2)),
                 combn(names(df), 2, function(x) length(intersect(df[, x[1]], df[, x[2]])))),
         c("col1", "col2", "count"))
   col1 col2 count
1     A    B     0
2     A    C     1
3     A    D     3
4     A    E     3
5     B    C     0
6     B    D     0
7     B    E     1
8     C    D     1
9     C    E     0
10    D    E     2
lmo
  • 37,904
  • 9
  • 56
  • 69