2

I'm doing cross-sell analysis for several products with R. I've already transformed the transactional data and it looks like this -

  df.articles <- cbind.data.frame(Art01,Art02,Art03)

  Art01         Art02      Art03
  bread         yoghurt    egg
  butter        bread      yoghurt
  cheese        butter     bread
  egg           cheese     NA
  potato        NA         NA

 Actual data is 'data.frame': 69099 obs. of  33 variables.

I want to have the list of all distinct articles and their counts that was sold with an Article(say bread or yoghurt in this case) Actual data consists of 56 articles for which I need to check all the articles with which it was cross-sold. So the results that I would want to have has to be look like -

     Products sold with **bread**           Products sold with **Yoghurt**  

     yoghurt         2                        bread   2
     egg             1                        egg     1
     cheese          1                       butter   1
     butter          1          

     .... and list goes on like this for say 52 different articles. 

I've tried couple of things but it is too manual for this big dataset. It would be great to have this problem solved with the help of library(data.table), if not, that shall also be very fine. Thank you very much in advance.

Jaap
  • 81,064
  • 34
  • 182
  • 193
mukund
  • 553
  • 3
  • 15
  • 1
    Consider reformatting your data into a data.frame with a two-column structure such as data.frame(article=c(...), ingredient = c(...)). I think your current data.frame is rather inefficient – Damiano Fantini Aug 23 '17 at 17:02
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Aug 23 '17 at 17:03

2 Answers2

5

There's...

library(data.table)
setDT(DF)
dat = setorder(melt(DF[, r := .I], id="r", na.rm=TRUE)[, !"variable"])
res = dat[, CJ(art = value, other_art = value), by=r][art != other_art, .N, keyby=.(art, other_art)]

        art other_art N
 1:   bread    butter 2
 2:   bread    cheese 1
 3:   bread       egg 1
 4:   bread   yoghurt 2
 5:  butter     bread 2
 6:  butter    cheese 1
 7:  butter   yoghurt 1
 8:  cheese     bread 1
 9:  cheese    butter 1
10:  cheese       egg 1
11:     egg     bread 1
12:     egg    cheese 1
13:     egg   yoghurt 1
14: yoghurt     bread 2
15: yoghurt    butter 1
16: yoghurt       egg 1

Comment. The OP mentions having 56 distinct items, which means a single order (r above) could have as many as 3136 = 56^2 rows after CJ. With a few thousand orders, this rapidly becomes problematic. This is typical when doing combinatorial computations, so hopefully this task is just for browsing the data and not analysing it.

Another idea when browsing, would be to use split and lapply to customize the display:

library(magrittr)
split(res, by="art", keep.by = FALSE) %>% lapply(. %$% setNames(N, other_art))

$bread
 butter  cheese     egg yoghurt 
      2       1       1       2 

$butter
  bread  cheese yoghurt 
      2       1       1 

$cheese
 bread butter    egg 
     1      1      1 

$egg
  bread  cheese yoghurt 
      1       1       1 

$yoghurt
 bread butter    egg 
     2      1      1 

I usually just explore with res[art == "bread"], res[art == "bread" & other_art == "butter"], etc, though, as @ycw suggested in a comment.

Magrittr isn't really needed here; it just allows for different syntax.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    I like this data frame output. It will be easy to filter the data frame to see the count (`N`) of each article combination. – www Aug 23 '17 at 17:31
  • 1
    @ycw Thanks. There's also `split(res, by="art", keep.by = FALSE) %>% lapply(. %$% setNames(N, other_art))` with magrittr to format the output. – Frank Aug 23 '17 at 17:37
  • 1
    @Frank : Thank you very very much :) This works perfectly. I don't have the words to thank you enough! – mukund Aug 23 '17 at 19:17
1

Here is an option. We can use some functions from tidyverse to create a list with results. The a_list4 is the final output. Each element is an article with numbers of associated articles.

# Prepare the data frame "dt"
dt <- read.table(text = "Art01         Art02      Art03
  bread         yoghurt    egg
                 butter        bread      yoghurt
                 cheese        butter     bread
                 egg           cheese     NA
                 potato        NA         NA",
                 header = TRUE, stringsAsFactors = FALSE)

# Load package
library(tidyverse)

# A vector with articles
articles <- unique(unlist(dt))

# Remove NA
articles <- articles[!is.na(articles)]

# A function to filter the data frame by articles
filter_fun <- function(article, dt){
  dt2 <- dt %>% filter(rowSums(. == article) > 0)
  return(dt2)
}

# Apply the filter_fun
a_list <- map(articles, filter_fun, dt = dt)
names(a_list) <- articles

# Get articles in each element of the list
a_list2 <- map(a_list, function(dt) unlist(dt))

# Remove the articles based on the name of that article
a_list3 <- map2(a_list2, names(a_list2), function(vec, article){
  vec[!(vec %in% article)]
})

# Count the number
a_list4 <- map(a_list3, table)

# See the results
a_list4

$bread

 butter  cheese     egg yoghurt 
      2       1       1       2 

$butter

  bread  cheese yoghurt 
      2       1       1 

$cheese

 bread butter 
     1      1 

$egg

  bread yoghurt 
      1       1 

$potato
< table of extent 0 >

$yoghurt

 bread butter    egg 
     2      1      1 
www
  • 38,575
  • 12
  • 48
  • 84
  • This has the benefit vs my approach of including an entry for potato even though it has no pairings. – Frank Aug 23 '17 at 17:41
  • @Frank Thanks for sharing your approach to create the list and comparing different approaches! – www Aug 23 '17 at 17:43
  • @ycw : Thank you for sharing this. I would definitely explore the 'tidyverse' lib. Thank again. – mukund Aug 23 '17 at 19:19