How to Transform Data to Find Index with Same Value

Question

I intend to find customers who have bought exactly the same products, This is the output that I prefer. I have also provided the dput code below.

The data I have is customers' behaviors--what they have bought. This will be my input (only two columns)

The example that I provided is a simplified version of my data. Customers will usually buy 10 to 20 products. There are around 50 products that consumers could choose to buy.

I am really confused what is an easy way to transform my data into the output that I prefer. Could you please give me any advice? Thanks

Input:

structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper", 
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper", 
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID", 
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer", 
"collector")), Products = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Output:

structure(list(`Products Bought` = c("Apple, Beer, Diaper", "Apple, Diaper"
), Customer_ID = c("1, 3, 6", "4, 5")), .Names = c("Products Bought", 
"Customer_ID"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-2L), spec = structure(list(cols = structure(list(`Products Bought` = structure(list(), class = c("collector_character", 
"collector")), Customer_ID = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Products Bought", "Customer_ID")), 
    default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"))

Please show a small reproducible using `dput` and the expected output based on that instead of images — akrun, May 06 '17 at 03:24

score 1 · Accepted Answer · edited May 23 '17 at 12:02

I am suspicious that you may want to look at structuring your data in a way that is more usable. In any case, the tidyverse can be a helpful way of thinking through your task.

As mentioned, posting code for others to start with can save them time and get you an answer faster.

library(dplyr)
library(stringr)
library(tidyr)

d <- data_frame(id=c(1,2,3,4,5,6)
     , bought=c('Apple, Beer, Diaper','Apple, Beer', 'Apple, Beer, Diaper, Diaper'
               , 'Apple, Diaper', 'Diaper, Apple', 'Apple, Diaper, Beer, Beer'))

d %>% 
## Unnest the values & take care of white space
## - This is the better data structure to have, anyways
mutate(buy=str_split(bought,',')) %>% 
unnest(buy) %>% mutate(buy=str_trim(buy)) %>% select(-bought) %>%

## Get distinct (and sort?)
distinct(id, buy) %>% arrange(id, buy) %>%

## Aggregate by id
group_by(id) %>% summarize(bought=paste(buy,collapse=', ')) %>% ungroup %>%

## Count
group_by(bought) %>% summarize(ids=paste(id,collapse=',')) %>% ungroup

EDIT: referencing this SO post for getting distinct combinations faster / cleaner in dplyr

score 0 · Answer 2 · answered May 08 '17 at 05:47

Using the given input data and data.table, this can be written as (rather convoluted) "one-liner":

dcast(unique(setDT(input)[, strsplit(Products, ", "), Customer_ID])[
  order(Customer_ID, V1)], 
  Customer_ID ~ ., paste, collapse = ", ")[
    , .(Customers = paste(Customer_ID, collapse = ", ")), .(Products = .)]
#              Products Customers
#1: Apple, Beer, Diaper   1, 3, 6
#2:         Apple, Beer         2
#3:       Apple, Diaper      4, 5

Note that the OP has dropped the second line with only one customer from the expected output but hasn't mentioned any criteria for filtering the output in the question.

Input data

(As given by OP):

input <- structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper", 
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper", 
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID", 
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer", 
"collector")), Products = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

How to Transform Data to Find Index with Same Value

2 Answers2

Input data