Which clustering distance-metric to find the most correlated groups of items

Question

I have restaurant sales data as below, and want to find the restaurants correlated to each other. I'm looking for a kind of clustering based on the correlation to each other; where "correlation" means "most matching/similar restaurants with the combination of Units Sold, Revenue & Footfall". (Note: this is a follow-up question to corelatedItems)

+----------+------------+---------+----------+
| Location | Units Sold | Revenue | Footfall |
+----------+------------+---------+----------+
| Loc - 01 |        100 | 1,150   |       85 |
| Loc - 02 |        100 | 1,250   |       60 |
| Loc - 03 |         90 | 990     |       90 |
| Loc - 04 |        120 | 1,200   |       98 |
| Loc - 05 |        115 | 1,035   |       87 |
| Loc - 06 |         89 | 1,157   |       74 |
| Loc - 07 |        110 | 1,265   |       80 |
+----------+------------+---------+----------+

You're asking for a [distance metric used in clustering](https://scikit-learn.org/stable/modules/clustering.html), please read that sklearn doc. — smci, Jul 28 '19 at 20:43
You already got a [good answer to the correlation question](https://stackoverflow.com/a/57234774/202229), the rest is just "How do I do clustering in sklearn?", which is covered by sklearn doc. Please try to write your own (sklearn+pandas) code then show us where you got stuck. — smci, Jul 28 '19 at 20:49
[sklearn](https://scikit-learn.org/stable/modules/clustering.html) is one of the main machine-learning libraries for Python, please check it out and skim its documentation (classifiers, features, pipelines, etc.), sounds like you're going to be using it a lot. — smci, Jul 28 '19 at 20:54

score 0 · Answer 1 · answered Jul 28 '19 at 20:39

First, set the index of the dataframe to be Location column for easy indexing

df1 = df1.set_index('Location')

Next, generate all combinations of Restaurants to compare:

import itertools
pairs = list(itertools.combinations(df1.index.values, 2))

Next, define a comparison function. Lets use the one used in the previous post

import numpy as np
def compare_function(row1, row2):
    return np.sqrt((row1['Units Sold']-row2['Units Sold'])**2 + 
           (row1['Revenue']- row2['Revenue'])**2 + 
           (row1['Footfall']- row2.loc[0, 'Footfall'])**2)

Next, iterate over all pairs and get results of comparison function:

results = [(row1, row2, compare_function(df1.loc[row1], df1.loc[row2]))
      for row1, row2 in pairs]

You now have a list of all pairs of restuarants and their distance from one another.

[`sklearn.cluster`](https://scikit-learn.org/stable/modules/clustering.html) implemented clustering years ago, no need to reinvent the wheel, and it has many excellent tutorials. — smci, Jul 28 '19 at 21:18

Which clustering distance-metric to find the most correlated groups of items

1 Answers1