1

I have data frame which represents if a customer bought certain items. The df looks as follows:

P1  P2  P3  P4  P5
1   2   0   0   0
1   1   0   0   0
0   0   0   3   0 
1   0   0   1   0
1   0   5   1   0
1   1   0   0   0

I am trying to create the occurrence of how many households purchased each item pair in the table. A snap shot of the result would like:

P1  P2  3
P1  P3  1
P1  P4  2
---------

As a first step, I converted the data in to a binary format - if the item was purchased. I am however, struggling with the function to convert it in to item pair group.

--- I am hoping to create a network graph using this data so maybe creating a matrix be of use too

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • 1
    If the size of your data is an issue then you might think about mentioning this in your question next time. I deleted my answer due to your rude comment however, to answer your question, *yes* my solution (which came a half hour before any answer you liked) scales efficiently if you followed the embedded instructions to wrap it in a function. See `?apply`. – Hack-R Jun 17 '16 at 19:48
  • I apologize, wasn't my intention, thank you for your answer though – user3562074 Jun 17 '16 at 20:00
  • 1
    See, also, [this general QA](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses); `as.data.frame(as.table(crossprod(as.matrix(d) > 0)))` – alexis_laz Jun 17 '16 at 20:21

3 Answers3

3

Calling your data d:

d = structure(list(P1 = c(1L, 1L, 0L, 1L, 1L, 1L), P2 = c(2L, 1L, 
0L, 0L, 0L, 1L), P3 = c(0L, 0L, 0L, 0L, 5L, 0L), P4 = c(0L, 0L, 
3L, 1L, 1L, 0L), P5 = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("P1", 
"P2", "P3", "P4", "P5"), class = "data.frame", row.names = c(NA, 
-6L))

We can use combn:

pairs = combn(x = names(d), m = 2)
counts = combn(x = names(d), m = 2, FUN = function(x) sum(d[[x[1]]] > 0 & d[[x[2]]] > 0))

cbind.data.frame(t(pairs), counts)
#     1  2 counts
# 1  P1 P2      3
# 2  P1 P3      1
# 3  P1 P4      2
# 4  P1 P5      0
# 5  P2 P3      0
# 6  P2 P4      0
# 7  P2 P5      0
# 8  P3 P4      1
# 9  P3 P5      0
# 10 P4 P5      0
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
2

First make all the pairwise combinations of column names:

  res <- as.data.frame( t( combn(names(dat),2)))

Then do a rowwise summation of the presence in both columns of a sale:

    res$counts <- apply(res, 1, function(rw)  sum( 
                                   pmin( as.logical(dat[ ,rw[1] ]), #rw[1] is col name 
                                         as.logical(dat[ , rw[2] ]) # 2nd col name
                        )               )        )

> res
   V1 V2 counts
1  P1 P2      3
2  P1 P3      1
3  P1 P4      2
4  P1 P5      0
5  P2 P3      0
6  P2 P4      0
7  P2 P5      0
8  P3 P4      1
9  P3 P5      0
10 P4 P5      0
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

You can use combn() to build this:

as.data.frame(t(combn(names(df),2, function(x) list(x[1], x[2], sum((df[,x[1]]*df[,x[2]])!=0)))))
   V1 V2 V3
1  P1 P2  3
2  P1 P3  1
3  P1 P4  2
4  P1 P5  0
5  P2 P3  0
6  P2 P4  0
7  P2 P5  0
8  P3 P4  1
9  P3 P5  0
10 P4 P5  0
HubertL
  • 19,246
  • 3
  • 32
  • 51
  • Nice. You can also do `as.data.frame(t(combn(df, 2, function(x) c(names(x), sum(x[1] * x[2] != 0)))))`. – Psidom Jun 17 '16 at 19:36