6

I have data that shows what customers have purchased certain items. They can purchase an item multiple times. What I need is a table that shows all of the possible pairwise combinations of items along with the unique number of customers who have purchased that combination (the diagonal of the table will just be the unique number of people purchasing each item).

Here is an example:

item <- c("h","h","h","j","j")
customer <- c("a","a","b","b","b")
test.data <- data.frame(item,customer)

Here is the test.data:

item customer
h    a
h    a
h    b
j    b
j    b

Result needed - a table with the items as row and column names, with the counts of unique customers purchasing the pair inside the table. So, 2 customers purchased item h, 1 purchased both item h and j, and 1 purchased item j.

item   h    j
h      2    1
j      1    1

I have tried using the table function, melt/cast, etc., but nothing gets me the counts I need within the table. My first step is using unique() to get rid of duplicate rows.

ano
  • 704
  • 5
  • 15
user1228982
  • 105
  • 8
  • 2
    You are talking about 3 customers, but I see only 2 in your `test.data`. – Jaap Oct 06 '15 at 18:29
  • You won't be able to extend this beyond two items. Perhaps you should consider another way of representing it. – Frank Oct 06 '15 at 18:36
  • @Jaap, customer b purchased both item h and j, so they are included in all four combinations (the table reads "2 customers who purchased h also purchased h", "1 customer who purchased h also purchased j", and "1 customer who purchased j also purchased j" - yes, the diagonal value interpretations are weird, but that is what I need. – user1228982 Oct 06 '15 at 18:50
  • 2
    Possibly helpful [QA](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses); specifically something like `tcrossprod(table(test.data) > 0)` – alexis_laz Oct 07 '15 at 07:17
  • @alexis_laz Nice find. I think this is essentially a dupe. `crossprod(table(rev(unique(test.data))))` gives the OP's desired output. – Frank Oct 07 '15 at 14:18
  • If anyone disagrees about the dupe, feel free to reverse it or ping me in the comments. – Frank Oct 07 '15 at 14:29
  • @Frank, it does look like that is basically the same question, however the solution given runs into memory issues while Chris's solution runs relatively quickly, even when I just ran it on 700+ items. – user1228982 Oct 08 '15 at 19:10
  • Ok. Good to know. I think it's okay to leave it as a dupe even if the better answer is here. – Frank Oct 08 '15 at 19:21

2 Answers2

5

Using data.table and the gtools package, we can recreate all possible permutations by customer:

library(data.table)
library(gtools)

item <- c("h","h","h","j","j")
customer <- c("a","a","b","b","b")
test.data <- data.table(item,customer)

DT <- unique(test.data) #The unique is used as multiple purchases do not count twice

tuples <- function(x){
  return(data.frame(permutations(length(x), 2, x, repeats.allowed = T, set = F), stringsAsFactors = F))
}

DO <- DT[, tuples(item), by = customer]

This gives:

   customer X1 X2
1:        a  h  h
2:        b  h  h
3:        b  h  j
4:        b  j  h
5:        b  j  j

Which is a list of all unique item pairings a customer has. As per your example we are treating h x j differently from j x h. We can now get the frequency of each pair using the table function:

table(DO$X1,DO$X2)
    j h
  j 1 1
  h 1 2
Chris
  • 6,302
  • 1
  • 27
  • 54
  • @Chris, thanks for the response. This works great for the small example I supplied, however there are millions of rows in the original data set (though less than 100 unique items), so I'm not sure how well this will scale! – user1228982 Oct 06 '15 at 19:02
  • @user1228982 with 100 customers, 100 products and 1,000,000 rows, this runs in less than a second on my computer. Test using `item <- sample(1:100,1000000, replace = T)` `customer <- sample(1:100,1000000, replace = T)` – Chris Oct 06 '15 at 19:05
  • @user1228982 although it is slower than Frank's solution – Chris Oct 06 '15 at 19:29
  • running both solutions, yours was the only one to complete the task (in under 1 minute), while I left the other solution running for hours with no completion. Thanks for your help! – user1228982 Oct 07 '15 at 13:29
5

Here's a base R solution:

n_intersect <- Vectorize( function(x,y) length(intersect(x,y)) )

cs_by_item <- with(test.data, tapply(customer, item, unique))

outer(cs_by_item , cs_by_item , n_intersect)
#   h j
# h 2 1
# j 1 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Your solution worked great for a smaller data set, however when I attempted to use it with my full data, it never finished. – user1228982 Oct 07 '15 at 13:30
  • @user1228982 Yeah, I'm not surprised. The `outer` function often has problems like that, eating up a lot of memory. – Frank Oct 07 '15 at 14:12