I have a DataFrame
as following(for simplicity) with points as index column:
import numpy as np
import pandas as pd
a = {'a' : [0.6,0.7,0.4,np.NaN,0.5,0.4,0.5,np.NaN],'b':['cat','bat','cat','cat','bat',np.NaN,'bat',np.nan]}
df = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
df
Since it has NaN
, I wanted the column to treat as a Number and did the following :
for col in df.select_dtypes(include=['object']):
s = pd.to_numeric(df[col], errors='coerce')
if s.notnull().any():
df[col] = s
After converting the column to numeric type, I wanted to calculate distance matrix as following:
def distmetric(x,y):
numeric5=x.select_dtypes(include=["number"])
others5=x.select_dtypes(exclude=["number"])
numeric6=y.select_dtypes(include=["number"])
others6=y.select_dtypes(exclude=["number"])
numnp5=numeric5.values
catnp5=others5.values
numnp6=numeric6.values
catnp6=others6.values
result3=np.around((np.repeat(numnp5, len(numnp6),axis=0) - np.tile(numnp6,(len(numnp5),1)))**2,3)
catres3=~(np.equal((np.repeat(catnp5,len(catnp6),axis=0)),(np.tile(catnp6,(len(catnp5),1)))))
sumtogeth3=result3.sum(axis=1)
sumcattoget3=catres3.sum(axis=1)
sum_result3=sumtogeth3+sumcattoget3
final_result3=np.around(np.sqrt(sum_result3),3)
final_result20=np.reshape(final_result3, (len(x.index),len(y.index)))
return final_result20
metric=distmetric(df,df)
print(metric)
I got a distance matrix as following:
[[0. 1.005 0.2 nan 1.005 1.02 1.005 nan]
[1.005 0. 1.044 nan 0.2 1.044 0.2 nan]
[0.2 1.044 0. nan 1.005 1. 1.005 nan]
[ nan nan nan nan nan nan nan nan]
[1.005 0.2 1.005 nan 0. 1.005 0. nan]
[1.02 1.044 1. nan 1.005 1. 1.005 nan]
[1.005 0.2 1.005 nan 0. 1.005 0. nan]
[ nan nan nan nan nan nan nan nan]]
I would like to get an output like:
x1 x2 x3 x4 x5 x6 x7 x8
x1 0.0 1.005 0.2 1.0 1.005 1.02 1.005 1.414
x2 1.005 0.0 1.044 1.414 0.2 1.044 0.2 1.414
x3 0.2 1.044 0.0 1.0 1.005 1.0 1.005 1.414
x4 1.0 1.414 1.0 0.0 1.414 1.414 1.414 1.0
x5 1.005 0.2 1.005 1.414 0.0 1.005 0.0 1.414
x6 1.02 1.044 1.0 1.414 1.005 0.0 1.005 1.0
x7 1.005 0.2 1.005 1.414 0.1 1.005 0.0 1.414
x8 1.414 1.414 1.414 1.0 1.414 1.0 1.414 0.0
I wanted to calculate distance between two NaN
which should result as 0 and distance between NaN
to any number or any string should result 1. Is there any method or way of doing it?
EDIT: I am calculating distance in the following form:
for each row:
if col is numerical:
then calculate (x1 element)-(x2 element)**2 and return this value to squareresult
if col is categorical:
then compare x1 element and x2 element.
if they are equal then cateresult=0
else cateresult=1
totaldistanceresultforrow=sqrt(squareresult+cateresult)
Note: NaN
-NaN
=0 and NaN
-any Num or string=1 (here '-' is subtract)