0

I have a data frame where the rows represent objects and columns are object features.

I am trying to compute the cosine similarity of the objects, when I run the code it seems to work just fine, however when I sort the distances, the closets objects all have a distance of 0, which would only be possible if their vectors were the same, which is not the case.

I, looked into the data output and it seems that any number that has a precision beyond E-16 just goes to 0 (its shows as 0 both in the terminal print out and also in the csv file output)

The columns are float64 format.

How can I show greater precision?

For reference here is the code I am running:

import pandas as pd
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform

dfe = pd.read_csv('file.csv')

dfe = dfe.set_index('object')

dfe = dfe.fillna(dfe.mean())

pairwise = pd.DataFrame(squareform(pdist(dfe, metric='cosine')),columns = dfe.index,index = dfe.index)

long_form = pairwise.unstack()

long_form.index.rename(['object_1', 'object_2'], inplace=True)
long_form = long_form.to_frame('distance').reset_index()
Mustard Tiger
  • 3,520
  • 8
  • 43
  • 68

1 Answers1

0

If you mean that you get 0 when you try to get difference between two elements, and get 0 when elements differ by less than 1e-16, that's the float64 precision limit. See for example print(1+1e-16). Information about this is available using numpy.finfo(numpy.float).

You should try using higher precision dtypes. For example:

dfe =  pd.read_csv('file.csv').astype(numpy.float128)

If the result of the squareform stays of the float64 dtype, you should update your scipy library to a later version.

Dimitry
  • 2,204
  • 1
  • 16
  • 24