How to calculate common values across different groups?

Question

I am trying to create a data frame for creating network charts using igraph package. I have sample data "mydata_data" and I want to create "expected_data".

I can easily calculate number of customers visited a particular store, but how do I calculate common set of customers who go to store x1 & store x2 etc.

I have 500+ stores, so I don't want to create columns manually. Sample data for reproducible purpose given below:

mydata_data<-data.frame(
  Customer_Name=c("A","A","C","D","D","B"),
  Store_Name=c("x1","x2","x2","x2","x3","x1"))

expected_data<-data.frame(
 Store_Name=c("x1","x2","x3","x1_x2","x2_x3","x1_x3"), 
 Customers_Visited=c(2,3,1,1,1,0))

should there also be a row with `x1_x2_x3` or you want combinations of only 2 `Store_Name` at a time ? — Ronak Shah, Dec 13 '18 at 06:24

Sotos · Accepted Answer · 2018-12-13T10:02:28.783

Another possible solution via dplyr is to create a list with all the combos for each customer, unnest that list, count and merge with a data frame with all the combinations, i.e.

library(tidyverse)

df %>%
    group_by(Customer_Name) %>%
    summarise(combos = list(unique(c(unique(Store_Name), paste(unique(Store_Name), collapse = '_'))))) %>%
    unnest() %>%
    group_by(combos) %>%
    count() %>%
    right_join(data.frame(combos = c(unique(df$Store_Name), combn(unique(df$Store_Name), 2, paste, collapse = '_'))))

which gives,

# A tibble: 6 x 2
# Groups:   combos [?]
  combos     n
  <chr>  <int>
1 x1         2
2 x2         3
3 x3         1
4 x1_x2      1
5 x1_x3     NA
6 x2_x3      1

NOTE: Make sure that your Store_Name variable is a character NOT factor, otherwise the combn() will fail

Julius Vainora · Answer 2 · 2018-12-13T11:19:08.070

Here's an igraph approach:

A <- as.matrix(as_adj(graph_from_edgelist(as.matrix(mydata_data), directed = FALSE)))
stores <- as.character(unique(mydata_data$Store_Name))
storeCombs <- t(combn(stores, 2))

data.frame(Store_Name = c(stores, apply(storeCombs, 1, paste, collapse = "_")),
           Customers_Visited = c(colSums(A)[stores], (A %*% A)[storeCombs]))
#   Store_Name Customers_Visited
# 1         x1                 2
# 2         x2                 3
# 3         x3                 1
# 4      x1_x2                 1
# 5      x1_x3                 0
# 6      x2_x3                 1

Explanation: A is the adjacency matrix of the corresponding undirected graph. stores is simply

stores
# [1] "x1" "x2" "x3"

while

storeCombs
#      [,1] [,2]
# [1,] "x1" "x2"
# [2,] "x1" "x3"
# [3,] "x2" "x3"

The main trick then is how to obtain Customers_Visited: the first three numbers are just the corresponding numbers of neighbours of stores, while the common customers we get from the common graph neighbours (which we get from the square of A).

score 1 · Answer 3 · answered Dec 13 '18 at 06:54

Here's one possible way to get the data

Here's a helper function adapted form here: Generate all combinations, of all lengths, in R, from a vector

comball <- function(x) do.call("c", lapply(seq_along(x), function(i) combn(as.character(x), i, FUN = list)))

Then you can use that with some tidy verse functions

library(dplyr)
library(purrr)
library(tidyr)

mydata_data %>% 
  group_by(Customer_Name) %>% 
  summarize(visits = list(comball(Store_Name))) %>% 
  mutate(visits = map(visits, ~map_chr(., ~paste(., collapse="_")))) %>% 
  unnest(visits) %>% 
  count(visits)

score 1 · Answer 4 · answered Dec 13 '18 at 09:26

Another option, with base R:

Get the list of all possible stores

all_stores <- as.character(unique(mydata_data$Store_Name))

Find the different combinations of 1 or 2 stores :

all_comb_store <- lapply(1:2, function(n) combn(all_stores, n))

For each number of stores combined, get the number of customers that visited both and then combined this value in a data.frame with the names of the stores:

do.call(rbind, 
        lapply(all_comb_store, 
               function(nb_comb) {
                 data.frame(Store_Name=if (nrow(nb_comb)==1) as.character(nb_comb) else apply(nb_comb, 2, paste, collapse="_"), 
                            Customers_Visited=apply(nb_comb, 2, 
                                                    function(vec_stores) {
                                                       length(Reduce(intersect, 
                                                              lapply(vec_stores, 
                                                                     function(store) mydata_data$Customer_Name[mydata_data$Store_Name %in% store])))}))}))
#  Store_Name Customers_Visited
#1         x1                 2
#2         x2                 3
#3         x3                 1
#4      x1_x2                 1
#5      x1_x3                 0
#6      x2_x3                 1

score 1 · Answer 5 · answered Dec 13 '18 at 12:24

Using dplyr: self join, then make group and get unique count. This should be a lot quicker compared to other answers where all combinations are considered.

Note: it doesn't show non-existent pairs. Also, here x1_x1 means, of course, x1.

left_join(mydata_data, mydata_data, by = "Customer_Name")  %>%
  transmute(Customer_Name,
            grp = paste(pmin(Store_Name.x, Store_Name.y),
                        pmax(Store_Name.x, Store_Name.y), sep = "_")) %>% 
  group_by(grp) %>% 
  summarise(n = n_distinct(Customer_Name))

# # A tibble: 5 x 2
#   grp       n
#   <chr> <int>
# 1 x1_x1     2
# 2 x1_x2     1
# 3 x2_x2     3
# 4 x2_x3     1
# 5 x3_x3     1

Data without factors:

mydata_data<-data.frame(
  Customer_Name=c("A","A","C","D","D","B"),
  Store_Name=c("x1","x2","x2","x2","x3","x1"),
  stringsAsFactors = FALSE)

How to calculate common values across different groups?

5 Answers5