I am using scipy
to compare different distance functions using data contained in pandas dataframes
. For reference, I am checking the distance between different parts my company manufactures.
(This is obviously a toy example for this question. Sorry if something is not complete, I am trying to make Minimal, Reproducible Example)
I have the test dataframe, x
which looks like this:
| part_number | make_buy_M | make_buy_B | alternate_Y | alternate_N | value |
|:-----------:|:----------:|:----------:|:-----------:|:-----------:|:-----:|
| A | 1 | 0 | 0 | 1 | 1065 |
I then have a large dataframe, data
, which looks exactly the same but contains many parts:
| part_number | make_buy_M | make_buy_B | alternate_Y | alternate_N | value |
|:-----------:|:----------:|:----------:|:-----------:|:-----------:|:-----:|
| B | 1 | 0 | 0 | 1 | 982 |
| C | 0 | 1 | 0 | 1 | 87 |
| D | 1 | 0 | 0 | 1 | 2342 |
| E | 0 | 1 | 1 | 0 | 56233 |
I have a function that loops through scipy
distance metrics. What I would like to do is, compare the x
value to each row of the dataframe, and store those results in a dict
import pandas as pd, numpy as np, scipy, gc as gc
from math import *
from decimal import Decimal
from scipy import spatial
# Resources:
# - https://dataconomy.com/2015/04/implementing-the-five-most-popular-similarity-measures-in-python/
# Resources
def euclidean_distance(x, y):
return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))
def cosine_similarity(x,y):
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
numerator = sum(a*b for a,b in zip(x,y))
denominator = square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
# Read in CSV
x = pd.read_csv('Test_Part_Directory')
y = pd.read_csv('Other_Parts_Directory')
metrics = ['cosine', 'euclidean']
euclidean_dict = {}
cosine_dict = {}
# How to loop through the y for this?
# for x in y.rows():
# current_row = y[x]
# Then do the below codes
for m in metrics:
try:
curr = scipy.spatial.distance.cdist(x.iloc[:,:], y.iloc[:,:], metric=m)
print("Metric: {} | Score: {} ".format(m, curr))
"""
Currently commented out
if m == 'cosine':
cosine_dict[part_number from dict] = curr
else:
euclidean_dict[part_number from dict] = curr
"""
except:
print("Error calculating {}".format(m))
Ultimately, I am looking for two dicts that contain key, value pairs of:
part_number: metric_score
, so something like:
I have written this code that gets to the current point, but have not
eucliean_dict = {'B': 0.954, 'C': 0.233, 'D': 0.003, 'E': 0.012}
I have looked at this question, but it tells me do not loop.
UPDATE - I did try the following:
for index, row in data.iterrows():
part_number = data['PART_NO'].iloc[0]
y = row.drop('PART_NO', axis=1)
for m in metrics:
try:
curr = scipy.spatial.distance.cdist(x.iloc[:,:], y.iloc[:,:], metric=m)
print("Part Number: {} | Metric: {} | Score: {} ".format(part_number, m, curr))
except:
print("Error calculating {}".format(m))
But received:
Traceback (most recent call last):
File "distance_function.py", line 95, in <module>
y = row.drop('PART_NO', axis=1)
File "C:\Python367-64\lib\site-packages\pandas\core\series.py", line 4139, in drop
errors=errors,
File "C:\Python367-64\lib\site-packages\pandas\core\generic.py", line 3923, in drop
axis_name = self._get_axis_name(axis)
File "C:\Python367-64\lib\site-packages\pandas\core\generic.py", line 420, in _get_axis_name
raise ValueError(f"No axis named {axis} for object type {cls}")
ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>
UPDATE 2 - I did try the following:
part_number = data['PART_NO'].iloc[0]
temp = row.to_frame()
y = temp.drop('PART_NO', axis=1)
Yet I receive the same error.