0

I have a dataframe/tibble of 2 columns - customerID, productDescription. I want to keep a tidy dataset but find all pairs of items purchased for each customer.

My goal is to then create a tilemap displaying the frequency of purchase or each product combination. How would this be possible given a sample dataset such as below?

Please note I do not want a column separating each purchase by comma - I only want 2 products showing as pairs. These pairs will be made up of all possible combinations of products bought by a unique ID

Data:

> dput(df)
structure(list(ID = structure(c(3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 
2L), .Label = c("abc", "def", "xyz"), class = "factor"), Product = structure(c(1L, 
3L, 4L, 5L, 1L, 2L, 4L, 2L, 3L), .Label = c("Product1", "Product3", 
"Product4", "Product5", "Product7"), class = "factor")), .Names = c("ID", 
"Product"), row.names = c(NA, -9L), class = "data.frame")

> df
   ID  Product
1 xyz Product1
2 xyz Product4
3 xyz Product5
4 xyz Product7
5 abc Product1
6 abc Product3
7 abc Product5
8 def Product3
9 def Product4

What I would like to see:

Data:

 > df
      ID  Product  ProductPair
    1 xyz Product1 Product1
    2 xyz Product1 Product4
    3 xyz Product1 Product5
    4 xyz Product1 Product7 
    5 xyz Product4 Product1
    6 xyz Product4 Product4
    7 xyz Product4 Product5
    8 xyz Product4 Product7
    9 xyz Product5 Product1
    10 xyz Product5 Product4
    11 xyz Product5 Product5
    12 xyz Product5 Product7
    13 xyz Product7 Product1
    14 xyz Product7 Product4
    15 xyz Product7 Product5
    16 xyz Product7 Product7

Note that here we have 4 products bought by xyz, and a combination of 2 gives us 4^2 possible combinations of products.

Chabo
  • 2,842
  • 3
  • 17
  • 32
  • What have you tried so far? Please share your code. Sample data is best provided using `dput()`, so users do not have to create your data by themselves. – Wimpel Mar 11 '19 at 19:05
  • I have tried using mutate to duplicate the column but I still haven't been able to get product pairs. I know I could use spread to create an individual column for each product but not sure if that would help – Brian Curry Mar 11 '19 at 19:21
  • 1
    please make your question reproducible: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Wimpel Mar 11 '19 at 19:23
  • `aggregate(Product ~ ID, data = df, toString)` This is what you are looking for, but its also a dup from here. https://stackoverflow.com/questions/16596515/aggregating-by-unique-identifier-and-concatenating-related-values-into-a-string – Chabo Mar 11 '19 at 20:05
  • Please edit your question to show an expected output when you get the chance – Chabo Mar 11 '19 at 20:41
  • Edited to show output now – Brian Curry Mar 11 '19 at 20:59
  • This question was incorrectly marked as a duplicate. – Chabo Mar 12 '19 at 15:19

1 Answers1

0

There probably is a better way to do this but (Note this is only for ID=xyz):

#Look for instances of xyz
logical_xyz <-df$ID=="xyz" 
#Subset based on logical
xyz_s<-df[logical_xyz,]
#Find unique products, get length
l_xyz<-nrow(unique(xyz_s))
#Use length to replicate for # of instances needed
xyz_fin<-xyz_s[rep(seq_len(nrow(xyz_s)), each=l_xyz),]

#Assign product matches by replicating original
xyz_fin[,3]<-rep(unique(xyz_s$Product),l_xyz)
rownames(xyz_fin) <- NULL
names(xyz_fin)<-c("ID","Product","Product_Match")

> xyz_fin
    ID  Product Product_Match
1  xyz Product1      Product1
2  xyz Product1      Product4
3  xyz Product1      Product5
4  xyz Product1      Product7
5  xyz Product4      Product1
6  xyz Product4      Product4
7  xyz Product4      Product5
8  xyz Product4      Product7
9  xyz Product5      Product1
10 xyz Product5      Product4
11 xyz Product5      Product5
12 xyz Product5      Product7
13 xyz Product7      Product1
14 xyz Product7      Product4
15 xyz Product7      Product5
16 xyz Product7      Product7

Edit: Using a loop to get all options

IDs<-unique(df$ID)
Fin_List<-list()

for(i in 1:length(IDs)){
    logical<-df$ID==IDs[i] 
    s<-df[logical,]
    len<-nrow(unique(s))
    fin<-s[rep(seq_len(nrow(s)), each=len),]
    fin[,3]<-rep(unique(s$Product),len)
    rownames(fin) <- NULL
    names(fin)<-c("ID","Product","Product_Match")
    Fin_List[[i]]<-fin
}

> Fin_List
[[1]]
    ID  Product Product_Match
1  xyz Product1      Product1
2  xyz Product1      Product4
3  xyz Product1      Product5
4  xyz Product1      Product7
5  xyz Product4      Product1
6  xyz Product4      Product4
7  xyz Product4      Product5
8  xyz Product4      Product7
9  xyz Product5      Product1
10 xyz Product5      Product4
11 xyz Product5      Product5
12 xyz Product5      Product7
13 xyz Product7      Product1
14 xyz Product7      Product4
15 xyz Product7      Product5
16 xyz Product7      Product7

[[2]]
   ID  Product Product_Match
1 abc Product1      Product1
2 abc Product1      Product3
3 abc Product1      Product5
4 abc Product3      Product1
5 abc Product3      Product3
6 abc Product3      Product5
7 abc Product5      Product1
8 abc Product5      Product3
9 abc Product5      Product5

[[3]]
   ID  Product Product_Match
1 def Product3      Product3
2 def Product3      Product4
3 def Product4      Product3
4 def Product4      Product4
Chabo
  • 2,842
  • 3
  • 17
  • 32
  • That would be good but I am trying to get it for all IDs if possible - I guess I could try to loop through each distinct ID? – Brian Curry Mar 11 '19 at 21:37
  • @BrianCurry Yeah that would be possible, you could gather all unique ID's in a list and loop through the list using each. The problem would be is I have no idea what would happen with this if they ID's occur at separate (non-continous) points in the data set. – Chabo Mar 11 '19 at 21:39
  • @BrianCurry I have edited to include loop. These listed data frames could be bound into one, and then you could search/count for all instances of combinations between `Product` and `Product_Match` which could eventually give you the graph. That might be suited for another question. – Chabo Mar 11 '19 at 21:48
  • 1
    This did end up working! Given the size of my dataset it took plenty long to run and concatenate to a dataframe but it ultimately did work! Thank you for the help – Brian Curry Mar 13 '19 at 11:58