Finding the most correlated item

Question

I have a restaurant sales details as below.

+----------+------------+---------+----------+
| Location | Units Sold | Revenue | Footfall |
+----------+------------+---------+----------+
| Loc - 01 |        100 | 1,150   |       85 |
+----------+------------+---------+----------+

I want to find the most correlated restaurant to the above from the below tables restaurant data

+----------+------------+---------+----------+
| Location | Units Sold | Revenue | Footfall |
+----------+------------+---------+----------+
| Loc - 02 |        100 | 1,250   |       60 |
| Loc - 03 |         90 | 990     |       90 |
| Loc - 04 |        120 | 1,200   |       98 |
| Loc - 05 |        115 | 1,035   |       87 |
| Loc - 06 |         89 | 1,157   |       74 |
| Loc - 07 |        110 | 1,265   |       80 |
+----------+------------+---------+----------+

please guide me how this can be done with python or pandas.. Note : - correlation means most matching/similar restaurant in terms of Units Sold, Revenue & Footfall.

I'm new to pandas. that's why I requested to guide me through the process to solve this. — Tommy, Jul 27 '19 at 17:28
At least read your data as a pandas dataframe, and calculate the correlation matrix. That should be a good start. — FatihAkici, Jul 27 '19 at 17:34
are you familiar with Numpy? I worked on a similar problem recently, I can break this down for you with a numpy/pandas approach. — Umar.H, Jul 27 '19 at 17:37
@Datanovice yes. I can manage a bit of numpy. please guide me — Tommy, Jul 27 '19 at 17:43
Pretty sure I've got the answer, but can you post your expected outcome ? — Umar.H, Jul 27 '19 at 17:48

jezrael · Accepted Answer · 2019-07-27T18:06:42.950

4

If your correlation should be described like minimal euclidean distance, solution is:

#convert columns to numeric
df1['Revenue'] = df1['Revenue'].str.replace(',','').astype(int)
df2['Revenue'] = df2['Revenue'].str.replace(',','').astype(int)

#distance of all columns subtracted by first row of first DataFrame
dist = np.sqrt((df2['Units Sold']-df1.loc[0, 'Units Sold'])**2 + 
               (df2['Revenue']- df1.loc[0, 'Revenue'])**2 + 
               (df2['Footfall']- df1.loc[0, 'Footfall'])**2)

print (dist)
0    103.077641
1    160.390149
2     55.398556
3    115.991379
4     17.058722
5    115.542200
dtype: float64

#get index of minimal value and select row of second df
print (df2.loc[[dist.idxmin()]])
   Location  Units Sold  Revenue  Footfall
4  Loc - 06          89     1157        74

edited Jul 27 '19 at 18:06

answered Jul 27 '19 at 18:02

jezrael

822,522
95
1,334
1,252

2

@Datanovice - thank you, I hope this is what OP need ;) – jezrael Jul 27 '19 at 18:04
@Datanovice @piRSquared I got my answer. thank you all. But, I have a follow up question. Here i wanted to select all the correlated restaurants based on `Loc - 01`.What if I want to select all the similarly correlated restaurants, without a base restaurants like `Loc - 01`? a kind of a clustering based on the correlation to each other? – Tommy Jul 28 '19 at 06:56
@Tommy - I think the best is create new question. – jezrael Jul 28 '19 at 06:58

score 2 · Answer 2 · answered Jul 27 '19 at 17:53

Might be a better way to do this, but I think this works, it's quite verbose so I've tried to keep the code clean and readable:

First, lets use a self defined numpy function from this post.

import numpy as np
import pandas as pd


def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return array[idx]

then using the arrays of your dataframe, pass in the value from your first dataframe to find the closest match.

us = find_nearest(df2['Units Sold'],df['Units Sold'][0])
ff = find_nearest(df2['Footfall'],df['Footfall'][0])
rev = find_nearest(df2['Revenue'],df['Revenue'][0])

print(us,ff,rev,sep=',')
100,87,1157

then return a data frame with all three conditions

    new_ df = (df2.loc[
    (df2['Units Sold'] == us) |
    (df2['Footfall'] == ff) |
    (df2['Revenue'] == rev)])

which gives us :

    Location    Units Sold  Revenue Footfall
0   Loc - 02    100         1250    60
3   Loc - 05    115         1035    87
4   Loc - 06    89          1157    74

score 2 · Answer 3 · answered Jul 27 '19 at 19:50

Fix Data

For numeric columns. I generalized this too much probably. Also, I set the index to be the 'Location' column

def fix(d):
    d.update(
        d.astype(str).replace(',', '', regex=True)
         .apply(pd.to_numeric, errors='ignore')
    )
    d.set_index('Location', inplace=True)

fix(df1)
fix(df2)

Manhattan Distance

df2.loc[[df2.sub(df1.loc['Loc - 01']).abs().sum(1).idxmin()]]

          Units Sold Revenue  Footfall
Location                              
Loc - 06          89    1157        74

Euclidean Distance

df2.loc[[df2.sub(df1.loc['Loc - 01']).pow(2).sum(1).pow(.5).idxmin()]]

          Units Sold Revenue  Footfall
Location                              
Loc - 06          89    1157        74

Finding the most correlated item

3 Answers3

Fix Data

Manhattan Distance

Euclidean Distance

Linked