Issue in handling NaN for distance calculation?

Question

I have a DataFrame as following(for simplicity) with points as index column:

 import numpy as np
import pandas as pd
a = {'a' : [0.6,0.7,0.4,np.NaN,0.5,0.4,0.5,np.NaN],'b':['cat','bat','cat','cat','bat',np.NaN,'bat',np.nan]}
df = pd.DataFrame(a,index=['x1','x2','x3','x4','x5','x6','x7','x8'])
df

Since it has NaN, I wanted the column to treat as a Number and did the following :

for col in df.select_dtypes(include=['object']):
        s = pd.to_numeric(df[col], errors='coerce')
        if s.notnull().any():
            df[col] = s

After converting the column to numeric type, I wanted to calculate distance matrix as following:

def distmetric(x,y):
    numeric5=x.select_dtypes(include=["number"])
    others5=x.select_dtypes(exclude=["number"])
    numeric6=y.select_dtypes(include=["number"])
    others6=y.select_dtypes(exclude=["number"])
    numnp5=numeric5.values
    catnp5=others5.values
    numnp6=numeric6.values
    catnp6=others6.values
    result3=np.around((np.repeat(numnp5, len(numnp6),axis=0) - np.tile(numnp6,(len(numnp5),1)))**2,3)
    catres3=~(np.equal((np.repeat(catnp5,len(catnp6),axis=0)),(np.tile(catnp6,(len(catnp5),1)))))
    sumtogeth3=result3.sum(axis=1)
    sumcattoget3=catres3.sum(axis=1)
    sum_result3=sumtogeth3+sumcattoget3
    final_result3=np.around(np.sqrt(sum_result3),3)
    final_result20=np.reshape(final_result3, (len(x.index),len(y.index)))
    return final_result20

metric=distmetric(df,df)
print(metric)

I got a distance matrix as following:

 [[0.    1.005 0.2     nan 1.005 1.02  1.005   nan]
 [1.005 0.    1.044   nan 0.2   1.044 0.2     nan]
 [0.2   1.044 0.      nan 1.005 1.    1.005   nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [1.02  1.044 1.      nan 1.005 1.    1.005   nan]
 [1.005 0.2   1.005   nan 0.    1.005 0.      nan]
 [  nan   nan   nan   nan   nan   nan   nan   nan]]

I would like to get an output like:

            x1       x2       x3      x4      x5       x6       x7       x8
x1         0.0      1.005    0.2     1.0     1.005    1.02     1.005   1.414
x2         1.005    0.0     1.044   1.414    0.2      1.044    0.2     1.414
x3         0.2      1.044    0.0     1.0     1.005    1.0      1.005   1.414
x4         1.0      1.414    1.0     0.0     1.414    1.414    1.414    1.0
x5         1.005    0.2     1.005   1.414    0.0      1.005    0.0     1.414
x6         1.02     1.044    1.0    1.414    1.005    0.0      1.005    1.0
x7         1.005    0.2     1.005   1.414    0.1      1.005    0.0     1.414
x8         1.414    1.414   1.414    1.0     1.414     1.0     1.414    0.0

I wanted to calculate distance between two NaN which should result as 0 and distance between NaN to any number or any string should result 1. Is there any method or way of doing it?

EDIT: I am calculating distance in the following form:

for each row:
     if col is numerical: 
         then calculate (x1 element)-(x2 element)**2 and return this value to squareresult
     if col is categorical:
         then compare x1 element and x2 element.
         if they are equal then cateresult=0 
         else cateresult=1
     totaldistanceresultforrow=sqrt(squareresult+cateresult)

Note: NaN-NaN=0 and NaN-any Num or string=1 (here '-' is subtract)

why dont you convert the NAN values in the dataframe with integer 0.. i guess it can solve the problem... refer https://stackoverflow.com/questions/13295735/how-can-i-replace-all-the-nan-values-with-zeros-in-a-column-of-a-pandas-datafra — iamklaus, Oct 01 '18 at 09:06
@ Sarthak Negi: I cannot convert the NaN to any integer according to my algorithm. If I do it, the distance metric would create an issue in my project. — Vas, Oct 01 '18 at 09:09
@ Sarthak Negi : If it is categorical data, I can do it. This wont impact my result. But I cannot do it for numerical data. — Vas, Oct 01 '18 at 09:17

score 0 · Answer 1 · answered Oct 01 '18 at 09:22

This helped me :

square_res = (df['a'].values - df['a'][:, None]) ** 2
numeric=pd.DataFrame(square_res)
idx = numeric.isnull().all()
alltrueindices=np.where(idx)

for index in alltrueindices:
    numeric.loc[index, index] = 0
numeric = numeric.fillna(1)
df['b']=df['b'].replace(np.nan, '?')
cat_res = (df['b'].values != df['b'][:, None])
res = (numeric + cat_res) ** .5

print(res.round(3))

Issue in handling NaN for distance calculation?

1 Answers1