How can I summarize this data with R?

Question

I am analyzing the flow of customers between different shopping venues. I have data like this:

df <- data.frame(customer.id=letters[seq(1,7)], 
                 shop.1=c(1,1,1,1,1,0,0),
                 shop.2=c(0,0,1,1,1,1,0),
                 shop.3=c(1,0,0,0,0,0,1))
df

#>   customer.id shop.1 shop.2 shop.3
#> 1           a      1      0      1
#> 2           b      1      0      0  
#> 3           c      1      1      0 
#> 4           d      1      1      0 
#> 5           e      1      1      0 
#> 6           f      0      1      0 
#> 7           g      0      0      1

So, for example:

customer "a" shopped at shops 1 & 3 only,
customer "b" shopped at shop 1 only,
customer "c" shopped at shops 1 & 2 only,
etc.

I want to summarize the data like so:

#>           shop.1 shop.2 shop.3 
#> shop.1         5      3      1
#> shop.2         3      4      0       
#> shop.3         1      0      2

So, for example, row 1 reads:

5 people shopped at both shop 1 and shop 1 (this is obviously a redundant observation)
3 people shopped at both shop 1 and shop 2
1 person shopped at both shop 1 and shop 3

How can I accomplish this (please note: I have many shops in my data set, so a scalable approach is preferred)?

Thanks, I've changed the example to show numeric data. But how do I get the summary that I'm looking for? — Rez99, Mar 05 '19 at 02:06
Dplyr would be great! Base R is also good - thank you for asking! — Rez99, Mar 05 '19 at 02:09
Your starting dataframe (once it's numeric) is conceptually similar to an edgelist/adjacency matrix, and you want to crosstabulate customer.id's by {combination of shops where they shopped}. But I can't remember how to find that duplicate question... it's pretty resistant to keyword-search, hang on... — smci, Mar 05 '19 at 02:25
Related: [**How do I calculate the co-occurrence in the table?**](https://stackoverflow.com/questions/19977596/how-do-i-calculate-the-co-occurrence-in-the-table) — smci, Mar 05 '19 at 09:43

score 4 · Accepted Answer · answered Mar 05 '19 at 03:37

4

crossprod can take care of what you want to do, after a bit of basic manipulation to get it into 2 columns representing customer and shop respectively:

tmp <- cbind(df[1],stack(df[-1]))
tmp <- tmp[tmp$values==1,]

crossprod(table(tmp[c(1,3)]))

#        ind
#ind      shop.1 shop.2 shop.3
#  shop.1      5      3      1
#  shop.2      3      4      0
#  shop.3      1      0      2

answered Mar 05 '19 at 03:37

thelatemail

91,185
12
128
188

It can be done with `ftable(xtabs(...))`, can you figure out how? – smci Mar 05 '19 at 03:43
@smci - can't say I've ever used that approach. I'd be interested to see how to do it though. – thelatemail Mar 05 '19 at 03:49
Your first line is just a manual `reshape2::melt(df, id.vars='customer.id')` – smci Mar 05 '19 at 03:56
In fact your first two lines are the equivalent of `reshape2::melt(df, id.vars='customer.id') %>% dplyr::filter(value==1)` – smci Mar 05 '19 at 03:57
@smci - absolutely, or `gather` from *tidyr* or other similar alternatives. `df %>% gather(shop,value,-customer.id) %>% filter(value==1)` for instance. – thelatemail Mar 05 '19 at 04:18
and a trailing `%>% select(-value)` to drop that column after we filtered. But supposedly `ftable(xtabs(...))` will do what we want directly, without needing these transforms. I give up... maybe someone else will show us how. – smci Mar 05 '19 at 04:21

smci · Answer 2 · 2019-03-06T22:57:32.373

You want to tabulate the co-occurrence of shop.* variables:

df[,2:4] <- sapply(df[,2:4], function(x) { ifelse(x=="", 0, 1) } )

1) It can supposedly be done using ftable(xtabs(...)), but I struggled with that for ages and couldn't get it. The closest I got is:

> ftable(xtabs(~ shop.1 + shop.2 + shop.3, df))

              shop.3 0 1
shop.1 shop.2           
0      0             0 1
       1             1 0
1      0             1 1
       1             3 0

2) As @thelatemail showed, you could also:

# Transform your df from wide-form to long-form...
library(dplyr)
library(reshape2)
occurrence_df <- reshape2::melt(df, id.vars='customer.id') %>%
                 dplyr::filter(value==1)

   customer.id variable value
1            a   shop.1     1
2            b   shop.1     1
3            c   shop.1     1
4            d   shop.1     1
5            e   shop.1     1
6            c   shop.2     1
7            d   shop.2     1
8            e   shop.2     1
9            f   shop.2     1
10           a   shop.3     1
11           g   shop.3     1

Really we can drop value column after the filter, so we could pipe %>% select(-value)

   customer.id variable
1            a   shop.1
2            b   shop.1
3            c   shop.1
4            d   shop.1
5            e   shop.1
6            c   shop.2
7            d   shop.2
8            e   shop.2
9            f   shop.2
10           a   shop.3
11           g   shop.3

# then same crossprod step as @thelatemail's answer:

crossprod(table(occurrence_df))

        variable
variable shop.1 shop.2 shop.3
  shop.1      5      3      1
  shop.2      3      4      0
  shop.3      1      0      2

(Footnotes:

First your data should be numeric (or factor), not string. You want to convert "x" to 1 and "" to 0.
If they are strings because they came from read.csv, use read.csv arguments stringsAsFactors=TRUE to make them factor, or colClasses to make them numeric, and see all the many duplicate questions on that.)

score 2 · Answer 3 · answered Mar 05 '19 at 04:18

In fact, matrix operation seems enough because the data frame only has 0 and 1.

First, exclude the customer.id column and change the data.frame to matrix. This might be easy. (mydf is the name of your data frame.)

# base R way
as.matrix(mydf[,-1])
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

library(dplyr) #dplyr way
(mymat <-
  mydf %>% 
  select(-customer.id) %>% 
  as.matrix())
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

With this matrix, just do the matrix operation as below.

t(mymat) %*% mymat
#>        shop.1 shop.2 shop.3
#> shop.1      5      3      1
#> shop.2      3      4      0
#> shop.3      1      0      2

You can get your answer.

How can I summarize this data with R?

3 Answers3