0

I have a data set of users and products

user    product
a       X
a       Y
a       Z
b       Z
c       Y
c       Z
...

from which I'd like a list of all pairwise combinations of products and the frequency of users that have that particular combination, e.g.

product_1    product_2     N
X            Y             1          <- user a
X            Z             1          <- user a
Y            Z             2          <- users a,c
...

So far, I've successfully made a list of all product combinations

data = data.frame(users=c('a','a','a','b','c','c'), products = c("X","Y","Z","Z","Y","Z"))
combinations = t(combn(unique(data$products), 2))

but when I implemented an ugly for-loop to iterate through the combinations and fetch info from the original data set it just wouldn't ever finish (actual data is about 1k products and 500k users, 25m rows in total).

What strategy would you use to implement this?

EDIT

Added a column of data to make things clearer. The result should count the number of users that have both product_1 and product_2 (even if they have other products as well).

jenswirf
  • 7,087
  • 11
  • 45
  • 65
  • What are you supposed to do in this `for` loop? I usually use `break` or `next` to speed a little the code in this loops. Another option is use the [apply](http://stackoverflow.com/a/7141669/2886003) functions that avoid the for loop – llrs May 27 '14 at 14:23
  • 1
    Show the output you'd get given your example data. – Simon O'Hanlon May 27 '14 at 14:25

2 Answers2

2

Using @Roland's DF object, you can also do

tt<-with(DF, table(user,product))
tt<-crossprod(tt)
tt[upper.tri(tt, T)]<-0
as.data.frame(as.table(tt))

which will give you the counts for all combinations, even the 0's.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • +1 nice use of `crossprod`, but most likely `table` will choke on these combinations. – Arun May 27 '14 at 14:57
  • Good point @Arun. And i realize it counts X,Y differently from Y,X (though i zero out the backwards counts). I guess i was trying to be too fancy. – MrFlick May 27 '14 at 15:08
1
DF <- read.table(text="user    product
a       X
a       Y
a       Z
b       Z
c       Y
c       Z", header=TRUE)

res <- combn(as.character(unique(DF$product[!duplicated(DF)])), 2, function(i, DF) {
  tmp <- DF[DF$product %in% i, "user"]
  cbind.data.frame(t(i), freq=sum(duplicated(tmp)))
}, DF=DF, simplify=FALSE)

do.call(rbind, res)
#  1 2 freq
#1 X Y    1
#2 X Z    1
#3 Y Z    2

This is probably not the most efficient solution. For each combination of products it takes the subset of the data and then counts duplicated users.

Roland
  • 127,288
  • 10
  • 191
  • 288