3

I have a pivot table from which I want to calculate the pairwise distance matrix between each day. As my dataset contains NaN values when I am using sklearn pairwise distances it yields at me.

I like to if there is anyway to overcome this?

The pivot table X is like :

 time   04:45:00    05:00:00   05:15:00
 date     
 01-01    61           NaN        44
 01-02    23            70         NaN

from sklearn.metrics.pairwise import pairwise_distances
pairwise_distances(X)

I face the following error:

ValueError: Input conains NaN, infinity or a value too large for dtype('float64')

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77

1 Answers1

0

Simple workaround would fill those NaN values with some imputed values. It could be 0, mean , forward / backward fill of values along the row, etc.

Try something like this!

d = """ time   04:45:00    05:00:00   05:15:00
 date     
 01-01    61           NaN        44
 01-02    23            70        NaN
"""


import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(d), sep='\s+')
data = df.iloc[1:,1:].fillna(method='ffill', axis=1)
data
    04:45:00    05:00:00    05:15:00
1   61.0    61.0    44.0
2   23.0    70.0    70.0

Now apply the pairwise_distances

from sklearn.metrics.pairwise import pairwise_distances
pairwise_distances(data)

# array([[ 0.        , 46.91481642],
#       [46.91481642,  0.        ]])

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77