-1

I have been trying do an data analyst exercise in R with date about sales. The dataframe is like:

   Order_ID Product                 
      <dbl> <chr>                   
 1   319631 34in Ultrawide Monitor  
 2   319631 Lightning Charging Cable
 3   319596 iPhone                  
 4   319596 Lightning Charging Cable
 5   319584 iPhone                  
 6   319584 Wired Headphones        
 7   319556 Google Phone            
 8   319556 Wired Headphones

And I have to find which products are bought together most often, the Order_ID has duplicates, i.e. they were purchased by the same person.

I did this excercise in Python but I can't do it on R. My Python code was:

pares_compras[['Order ID', 'Product']]

>  Order ID Product
2   176560  Google Phone
3   176560  Wired Headphones
17  176574  Google Phone
18  176574  USB-C Charging Cable
29  176585  Bose SoundSport Headphones

pares_compras.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))

>
2                             Google Phone,Wired Headphones
3                             Google Phone,Wired Headphones
17                        Google Phone,USB-C Charging Cable
18                        Google Phone,USB-C Charging Cable

pares_compras = pares_compras[['Order ID', 'Grouped Products']].drop_duplicates()
pares_compras

>   Order ID    Grouped Products
2   176560  Google Phone,Wired Headphones
17  176574  Google Phone,USB-C Charging Cable
29  176585  Bose SoundSport Headphones,Bose SoundSport Hea...
31  176586  AAA Batteries (4-pack),Google Phone
118 176672  Lightning Charging Cable,USB-C Charging Cable

count = Counter()

for row in pares_compras['Grouped Products']:
    row_list = row.split(',')
    count.update(Counter(combinations(row_list, 2)))
count

> Counter({('Google Phone', 'Wired Headphones'): 414,
         ('Google Phone', 'USB-C Charging Cable'): 987,
         ('Bose SoundSport Headphones', 'Bose SoundSport Headphones'): 27, ... )}

for key, num in count.most_common(5):
    print(key, num)
>
('iPhone', 'Lightning Charging Cable') 1005
('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361

So, with that I could solve the exercise but, like I said before, I can't do the same in R, I don't find the way, I've just started in R. If anyone can help me I would really appreciate it, thnaks.

Walras
  • 7
  • 2
  • I don't think dplyr is necessarily the way to go here. In fact I think from the tidyverse I would be more likely to use tidyr to create columns with each item, rows for each purchase, then so something like calculate the correlations. But even if you just want to get the joint frequencies I think that is the way to go. – Elin Aug 01 '22 at 23:53
  • @Elin yeah, maybe I can put each purchase in one row, in differents columns, I thought that but I can't imagine how count the couples frequency then – Walras Aug 02 '22 at 00:04
  • @JahiZamy yes! that is what I was looking for... but ... I don't know why, I don't get same numbers I got in Python... – Walras Aug 02 '22 at 00:38
  • Walras are you only looking for the combinations of exactly two products? and how often they are bought together? –  Aug 02 '22 at 00:54
  • @ZheyuanLi My code could be wrong, yes. I will check it out later. In my Python code I look all the combinations that can be make with 2 products and then I count the occurrences of each possible combination. I only use 2 products because are the most common compared to those of 3, 4 or 5 – Walras Aug 02 '22 at 00:57
  • @JahiZamy Yes! only two... I know It's not realistic but for now is enough for me – Walras Aug 02 '22 at 00:59

3 Answers3

0

Hmm I think that a co-occurence matrix would be a good solution actually.

another approach is to think about how different or similar the product profiles are.

orders <- read.csv(header = TRUE, text ='
"row", "order", "product"
1,   319631, "34in Ultrawide Monitor"
2,   319631, "Lightning Charging Cable"
3,   319596, "iPhone"
4,   319596, "Lightning Charging Cable"
5,   319584, "iPhone"
6,   319584, "Wired Headphones"
7,   319556, "Google Phone"
8,   319556, "Wired Headphones"')  |>
  dplyr::mutate(product = trimws(product))

df <- tidyr::pivot_wider(orders,
                   values_from = product, 
                   names_from = product, 
                   id_cols = order) |>
  dplyr::mutate(across( `34in Ultrawide Monitor`:`Google Phone` ,
                        ~!is.na(.x))) |>
  select(-order) 
cor(df)

dist(t(df))
dist(t(df), method = "binary")




Elin
  • 6,507
  • 3
  • 25
  • 47
0

I'll just leave this here for you as an alternative for you.

Here I list the unique combinations in two data.frames and check for sameness using a nested apply functions, counting the results with rowsums after cbinding

a <- expand.grid(a = df$Product,b = df$Product) |>
  rowwise() |> 
  mutate(c = list(sort(c(a, b))), a = c[[1]], b = c[[2]]) |> 
  distinct() |> 
  filter(a != b)
  
  b <- df |> 
  group_by(Order_ID) |> 
  summarise(Product = list(c(Product)))

     
a$count <- rowSums(do.call(cbind, 
lapply(b$Product, \(one) sapply(a$c, \(two) +(all(two %in% one))))))

   a                      b                      count
   <chr>                  <chr>                  <dbl>
 1 34inUltrawideMonitor   LightningChargingCable     1
 2 34inUltrawideMonitor   iPhone                     0
 3 34inUltrawideMonitor   WiredHeadphones            0
 4 34inUltrawideMonitor   GooglePhone                0
 5 iPhone                 LightningChargingCable     1
 6 LightningChargingCable WiredHeadphones            0
 7 GooglePhone            LightningChargingCable     0
 8 iPhone                 WiredHeadphones            1
 9 GooglePhone            iPhone                     0
10 GooglePhone            WiredHeadphones            1

0

A solution using a data.table join instead of a co-occurrence matrix. With a larger dataset (~3M rows), it is almost twice as fast on my machine than using crossprod(xtabs( from this answer.

library(data.table)
library(Matrix) # for comparison with a co-occurrence matrix solution

# Example dataset
n <- 1e6L
orderID <- rep.int(sample.int(n), rpois(n, 1) + 2L)
dt <- unique(data.table(orderID, product = stringi::stri_rand_strings(length(orderID), 2, pattern = "[a-z]")))

# solution using a data.table join
f1 <- function(dt) {
  dt2 <- dt[
    , x := .I
  ][
    dt,
    on = .(orderID = orderID, x > x),
    nomatch = 0
  ][
    product > i.product, c("product", "i.product") := list(i.product, product)
  ][
    , .(count = .N), .(product, i.product)
  ]
  dt[, x := NULL]
  setnames(dt2, c("product1", "product2", "count"))
  setorder(dt2, -count, product1, product2)
}

# co-occurrence matrix solution (slightly modified so the output of the two
# functions is the same)
f2 <- function(dt) {
  dt$product <- as.factor(dt$product)
  dt4 <- setDT(
    summary(
      crossprod(
        xtabs(~ orderID + product, dt, sparse = TRUE)
      )
    )
  )[
    i < j
  ][
    , `:=`(
      i = as.character(levels(dt$product)[i]),
      j = as.character(levels(dt$product)[j]),
      x = as.integer(x)
    )
  ]
  dt[, product := as.character(product)]
  attr(dt4, "header") <- NULL
  setnames(dt4, c("product1", "product2", "count"))
  setorder(dt4, -count, product1, product2)
}

Benchmarking:
    
#> Unit: seconds
#>  expr      min       lq     mean   median       uq      max neval
#>    f1 1.026762 1.128890 1.224291 1.261732 1.278617 1.362014    10
#>    f2 1.984068 2.295159 2.355434 2.403337 2.465651 2.589428    10
jblood94
  • 10,340
  • 1
  • 10
  • 15