6

I am trying to create a data frame for creating network charts using igraph package. I have sample data "mydata_data" and I want to create "expected_data".

I can easily calculate number of customers visited a particular store, but how do I calculate common set of customers who go to store x1 & store x2 etc.

I have 500+ stores, so I don't want to create columns manually. Sample data for reproducible purpose given below:

mydata_data<-data.frame(
  Customer_Name=c("A","A","C","D","D","B"),
  Store_Name=c("x1","x2","x2","x2","x3","x1"))

expected_data<-data.frame(
 Store_Name=c("x1","x2","x3","x1_x2","x2_x3","x1_x3"), 
 Customers_Visited=c(2,3,1,1,1,0))
zx8754
  • 52,746
  • 12
  • 114
  • 209
Yogesh Kumar
  • 609
  • 6
  • 22

5 Answers5

2

Another possible solution via dplyr is to create a list with all the combos for each customer, unnest that list, count and merge with a data frame with all the combinations, i.e.

library(tidyverse)

df %>%
    group_by(Customer_Name) %>%
    summarise(combos = list(unique(c(unique(Store_Name), paste(unique(Store_Name), collapse = '_'))))) %>%
    unnest() %>%
    group_by(combos) %>%
    count() %>%
    right_join(data.frame(combos = c(unique(df$Store_Name), combn(unique(df$Store_Name), 2, paste, collapse = '_'))))

which gives,

# A tibble: 6 x 2
# Groups:   combos [?]
  combos     n
  <chr>  <int>
1 x1         2
2 x2         3
3 x3         1
4 x1_x2      1
5 x1_x3     NA
6 x2_x3      1

NOTE: Make sure that your Store_Name variable is a character NOT factor, otherwise the combn() will fail

Sotos
  • 51,121
  • 6
  • 32
  • 66
2

Here's an igraph approach:

A <- as.matrix(as_adj(graph_from_edgelist(as.matrix(mydata_data), directed = FALSE)))
stores <- as.character(unique(mydata_data$Store_Name))
storeCombs <- t(combn(stores, 2))

data.frame(Store_Name = c(stores, apply(storeCombs, 1, paste, collapse = "_")),
           Customers_Visited = c(colSums(A)[stores], (A %*% A)[storeCombs]))
#   Store_Name Customers_Visited
# 1         x1                 2
# 2         x2                 3
# 3         x3                 1
# 4      x1_x2                 1
# 5      x1_x3                 0
# 6      x2_x3                 1

Explanation: A is the adjacency matrix of the corresponding undirected graph. stores is simply

stores
# [1] "x1" "x2" "x3"

while

storeCombs
#      [,1] [,2]
# [1,] "x1" "x2"
# [2,] "x1" "x3"
# [3,] "x2" "x3"

The main trick then is how to obtain Customers_Visited: the first three numbers are just the corresponding numbers of neighbours of stores, while the common customers we get from the common graph neighbours (which we get from the square of A).

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
1

Here's one possible way to get the data

Here's a helper function adapted form here: Generate all combinations, of all lengths, in R, from a vector

comball <- function(x) do.call("c", lapply(seq_along(x), function(i) combn(as.character(x), i, FUN = list)))

Then you can use that with some tidy verse functions

library(dplyr)
library(purrr)
library(tidyr)

mydata_data %>% 
  group_by(Customer_Name) %>% 
  summarize(visits = list(comball(Store_Name))) %>% 
  mutate(visits = map(visits, ~map_chr(., ~paste(., collapse="_")))) %>% 
  unnest(visits) %>% 
  count(visits)
MrFlick
  • 195,160
  • 17
  • 277
  • 295
1

Another option, with base R:

Get the list of all possible stores

all_stores <- as.character(unique(mydata_data$Store_Name))

Find the different combinations of 1 or 2 stores :

all_comb_store <- lapply(1:2, function(n) combn(all_stores, n))

For each number of stores combined, get the number of customers that visited both and then combined this value in a data.frame with the names of the stores:

do.call(rbind, 
        lapply(all_comb_store, 
               function(nb_comb) {
                 data.frame(Store_Name=if (nrow(nb_comb)==1) as.character(nb_comb) else apply(nb_comb, 2, paste, collapse="_"), 
                            Customers_Visited=apply(nb_comb, 2, 
                                                    function(vec_stores) {
                                                       length(Reduce(intersect, 
                                                              lapply(vec_stores, 
                                                                     function(store) mydata_data$Customer_Name[mydata_data$Store_Name %in% store])))}))}))
#  Store_Name Customers_Visited
#1         x1                 2
#2         x2                 3
#3         x3                 1
#4      x1_x2                 1
#5      x1_x3                 0
#6      x2_x3                 1
Cath
  • 23,906
  • 5
  • 52
  • 86
1

Using dplyr: self join, then make group and get unique count. This should be a lot quicker compared to other answers where all combinations are considered.

Note: it doesn't show non-existent pairs. Also, here x1_x1 means, of course, x1.

left_join(mydata_data, mydata_data, by = "Customer_Name")  %>%
  transmute(Customer_Name,
            grp = paste(pmin(Store_Name.x, Store_Name.y),
                        pmax(Store_Name.x, Store_Name.y), sep = "_")) %>% 
  group_by(grp) %>% 
  summarise(n = n_distinct(Customer_Name))

# # A tibble: 5 x 2
#   grp       n
#   <chr> <int>
# 1 x1_x1     2
# 2 x1_x2     1
# 3 x2_x2     3
# 4 x2_x3     1
# 5 x3_x3     1

Data without factors:

mydata_data<-data.frame(
  Customer_Name=c("A","A","C","D","D","B"),
  Store_Name=c("x1","x2","x2","x2","x3","x1"),
  stringsAsFactors = FALSE)
zx8754
  • 52,746
  • 12
  • 114
  • 209