2

I have a set of data by store by week with 2 features - sales and pp. I want to cluster them based on these features ideally to group the ones with the most similar trading patterns throughout the year.

Is it possible to do this with the data I have. My understanding is the features are the columns and the rows are what will be assigned labels based on the clusters, but I have weeks to take into account so I don't know if these should be in the rows or the columns?

library(data.table)
dt <- data.table(weeks = rep(seq(1:5),5),
             store = c(rep("a", 5), rep("b", 5), rep("c", 5), 
                       rep("e", 5), rep("d", 5)),
             sales = rep(rnorm(5), 5),
             pp = rep(rnorm(5), 5))
dt <- dcast.data.table(dt, store ~ weeks, value.var = c("sales", "pp"))

thanks

MidnightDataGeek
  • 938
  • 12
  • 21
  • Do you think that the clustering of stores with respect to (sales,pp) to vary from week to week. – Silence Dogood Aug 23 '16 at 09:55
  • Yes, there are over 1,000 stores in the data and some of them will be in high tourist locations so there will be spikes in the data for them stores when it's really hot etc. So I was expecting say within the stores where this is evident, there might be a further subset based on the pp metric. Usually I've just taken the average weekly sales/pp and used that so removing the individual weeks but I am hoping to be able to capture the variation in the weeks throughout the year too. – MidnightDataGeek Aug 23 '16 at 10:11

1 Answers1

1

Since you have more than 1000 stores, the below demo might not be applicable directly but hopefully point you in the right direction.

You can analyze clustering of stores for a weekly basis or any another variant of weeks(N=4,8,...)

Here we look at store clusters at weekly frequency:

Test Data

library(dplyr)     #data manipulation
library(ggdendro)  #extracting clusters
library(ggplot2)   #plotting
library(gridExtra) #for arranging ggplot graphs in grids 

set.seed(42)

DF <- data.frame(weeks = rep(seq(1:5),5),
    store = c(rep("A", 5), rep("B", 5), rep("C", 5), 
            rep("E", 5), rep("D", 5)),
    sales = rnorm(25),
    pp = rnorm(25))


weekInput = unique(DF$weeks)

Choosing frequency of weeks:

In the simple case, I have included discrete weeks i.e. 1,2,3,4 and 5.

Suppose you wish to choose N = 4 as frequency of weeks i.e. weeks 1-to-4,5-to-8 etc, you can use following to create partitions of weekInput over the year and modify as per your requirements.

#weekFreq = 4
#StartWeek = 1
#EndWeek = 52
#startPoints=seq(StartWeek,EndWeek,weekFreq)    
#endPoints= c(tail(startPoints,-1)-1,EndWeek)
#
#freqDF = data.frame(cbind(startPoints,endPoints))
#weekInput = lapply(1:nrow(freqDF),function(x) { z= freqDF[x,]; z=as.vector(as.matrix(z)) } )
#head(weekInput)
#[[1]]
#[1] 1 4
#
#[[2]]
#[1] 5 8
#
#[[3]]
#[1]  9 12
#
#[[4]]
#[1] 13 16
#
#[[5]]
#[1] 17 20
#
#[[6]]
#[1] 21 24

Plotting of clusters

Good resources for plotting various dendrograms are here and here

For each week, we calculate distance matrix over numeric data and create hierarchical clusters and use ggdendro package for plotting and output a list of plot objects

plotList = lapply(weekInput,function(x) {

subsetWeek=DF %>% 
  group_by(weeks) %>% 
  filter(weeks==x) %>%  #you could change this to `weeks %in% c(x[1],x[2])`
  as.data.frame()  %>%  # x[1] and x[2] are start and end points of weekInput
  select(-weeks) %>% 
  as.data.frame()

#For numeric features of data, compute the distance matrix and form hierarchical cluster

numericDF= subsetWeek[,sapply(subsetWeek,is.numeric)]

clustDF = hclust(dist(numericDF))

#You can choose to limit the clusters to N = n, as per your discretion
#clustDF =  cutree(clustDF, 4)


clustDF$labels = subsetWeek$store

#Use functions from ggdendro package for extracting clusters for ease in plotting

clustDendro = as.dendrogram(clustDF)

dendroData = dendro_data(clustDendro,type="rectangle")

Labels = label(dendroData)
Labels$group <- c(rep("Area1", 2), rep("Area2", 2), rep("Area3", 1))


gPlot = ggplot(segment(dendroData)) +
    geom_segment(aes(x=x,y=y,xend=xend,yend=yend)) + 
    geom_label(data=Labels,aes(label=label,x=x,y=0,label.size=5,colour=Labels$group,fontface="bold")) +
    ggtitle(paste0("Store Clusters for Week:",x)) +
    labs(color="Area Names\n")


gPlot = gPlot + theme(legend.title = element_text(face = "bold"))

return(gPlot)


})

Arranging Plots

The list of plot objects from above can be arranged as per requirement. Useful link on more details are here

grid::grid.newpage()
grid::grid.draw(do.call(rbind,lapply(plotList,function(x) ggplotGrob(x))))

Single Week: enter image description here

All Weeks

Due to single column arrangement, readability is slightly affected enter image description here

Community
  • 1
  • 1
Silence Dogood
  • 3,587
  • 1
  • 13
  • 17