1

I am planning to run apriori algorithm on a dataset to find association among product purchase. The dataset looks like this

order_id <- c('AG-2011-2040','IN-2011-47883', 'HU-2011-1220','IT-2011-3647632','IN-2011-47883','IN-2011-47883','IN-2011-30733','CA-2011-115161','AO-2011-1390','ID-2011-56493'  )
    
product_name <- c('Tenex Lockers, Blue','Acme Trimmer, High Speed', 'Tenex Box, Single Width', 'Enermax Note Cards, Premium','Eldon Light Bulb, Duo Pack','Eaton Computer Printout Paper, 8.5 x 11', 'Brother Personal Copier, Laser','Sauder Facets Collection Library, Sky Alder Finish', 'Fellowes Lockers, Wire Frame','Tenex Trays, Single Width')
        
df <- data.frame(order_id, product_name)
df 

In the image, you can see that on the left there is Order.ID and in the right column, there's product name. I want to concatenate all the products that have the same id in a single row. In other words, I want to aggregate all the products based on Order.ID. Since products are not numerical, it is hard to do so.

Any idea on how to approach this problem would be appreciated. Alternatively, if you have a different idea on how to prepare this data to run apriori algorithm, that would be great too!

Thanks in advance!

  • 1
    Welcome to SO! Please dont add data as images, it makes it very challenging to reproduce. Instead, edit your question to include data as reproducible code. See some tips on how to do that [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – jpsmith Mar 24 '23 at 20:42
  • 1
    Thank you @jpsmith! I have replaced the data with code. I hope this helps. Let me know if I can provide anything else. – Timothy Roy Mar 24 '23 at 20:54

2 Answers2

0

You can concatenate all the products using the same order ID using the dplyr package's summarize (with the .by function for order id) and base R paste. Here, I chose to separate by a comma, but could change to anything else you'd want:

library(dplyr)
df %>% 
  summarise(ProductName = 
              paste(ProductName, collapse = ", "), 
            .by = OrderID)

#  OrderID      ProductName
#1    EX-1       B, E, J, W
#2    EX-2             D, I
#3    EX-3             S, Y
#4    EX-4             K, V
#5    EX-5                C
#6    EX-6    F, G, H, N, O
#7    EX-7             L, T
#8    EX-8 A, M, P, Q, R, U
#9    EX-9                X

# or for older versions of dplyr:
df %>% 
  group_by(OrderID) %>%
  summarise(ProductName = 
              paste(ProductName, collapse=", "))

Data:

df <- data.frame(OrderID = paste0("EX-", sample(1:9, 25, replace = TRUE)),
                 ProductName = LETTERS[1:25])
df <- df[order(df$OrderID),]
jpsmith
  • 11,023
  • 5
  • 15
  • 36
0

Split-map based solution (probably among the fastest solution for large datasets):

library(magrittr)
aggregater <- function(tibble) {
  tibble::tibble(
    order_id = tibble$order_id[1],
    product_name = stringr::str_flatten(tibble$product_name, '; ')
  )
}


order_id <- c('AG-2011-2040','IN-2011-47883', 'HU-2011-1220','IT-2011-3647632','IN-2011-47883','IN-2011-47883','IN-2011-30733','CA-2011-115161','AO-2011-1390','ID-2011-56493'  )
    
product_name <- c('Tenex Lockers, Blue','Acme Trimmer, High Speed', 'Tenex Box, Single Width', 'Enermax Note Cards, Premium','Eldon Light Bulb, Duo Pack','Eaton Computer Printout Paper, 8.5 x 11', 'Brother Personal Copier, Laser','Sauder Facets Collection Library, Sky Alder Finish', 'Fellowes Lockers, Wire Frame','Tenex Trays, Single Width')
        
df <- data.frame(order_id, product_name)

df %>%
  split.data.frame(.$order_id) %>%
  purrr::map(aggregater) %>%
  dplyr::bind_rows()


# A tibble: 8 x 2
  order_id       product_name                                                          
  <chr>          <chr>                                                          
1 AG-2011-2040   Tenex Lockers, Blue                                            
2 AO-2011-1390   Fellowes Lockers, Wire Frame                                   
3 CA-2011-115161 Sauder Facets Collection Library, Sky Alder Finish             
4 HU-2011-1220   Tenex Box, Single Width                                        
5 ID-2011-56493  Tenex Trays, Single Width                                      
6 IN-2011-30733  Brother Personal Copier, Laser                                 
7 IN-2011-47883  Acme Trimmer, High Speed; Eldon Light Bulb, Duo Pack; Eaton Co~
8 IT-2011-36476~ Enermax Note Cards, Premium 
Baraliuh
  • 2,009
  • 5
  • 11