1

I want to calculate pairwise distances (like here). The columns in the data frame are of type string and float. Any idea or trick how to avoid the 'ValueError: could not convert string to float: 'abc-def''.

For testing purpose I have used the 'cosine' metric. Does any other metric work with float and strings? Or is possible to implement a pairwise distance calculation on your own that can handle floats and string?

Thanks

Alexander L. Hayes
  • 3,892
  • 4
  • 13
  • 34
TRK
  • 85
  • 6
  • You might try combining cosine and [Hamming distance](https://en.wikipedia.org/wiki/Hamming_distance), or [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) if strings are unequal lengths. – evces Dec 19 '22 at 14:42
  • Thanks, good suggestion. How can I combine two metrics in the implementation? – TRK Dec 19 '22 at 15:09
  • See my complete answer for one possible solution. – evces Dec 19 '22 at 16:30

2 Answers2

0

Cosine distance is a good choice if whatever your float variables represent can be considered as directions in a vector space (i.e., magnitude does not matter).

Levenshtein distance (string edit distance) may be a good choice for your string columns (although of course this does not capture anything about semantics). However, pairwise Levenshtein distance doesn't appear to be available from any of the usual packages (also note this answer did not work for me), but you can obtain them with a few lines of code as shown below.

Combining different distance metrics can be done in many ways depending on your data and what you are trying to do. In the code below I use a simple summing method to combine a cosine distance matrix and multiple Levenshtein distance matrices. Notice that I normalize the Levenshtein distances by the maximum edit distances so they are in [0, 1]. This ensures they are weighted similarly to cosine distances in the final result. However, it's not perfect because cosine distances are in [0, 2]. Also, for multiple string columns, I am computing the pairwise Levenshtein distance on each column separately.

import numpy as np
import pandas as pd
from leven import levenshtein
from itertools import combinations
from sklearn.metrics import pairwise_distances

df = pd.DataFrame(
    {"float1": [1, 2, 0, 2], "float2": [0,1,1,1], "string1": ["cat", "cattle", "dog", "drat"], "string2":["a","aa","aba","abc"]}
)
# Pairwise levenshtein distances (normalized by max distance)
N = len(df)
out = np.zeros((N, N))
inds = np.triu_indices(N,k=1)
string_cols = df.columns[df.iloc[0].apply(lambda x: isinstance(x,str))].values
for col in string_cols:
    lev_dists = [levenshtein(i,j) for (i,j) in combinations(df[col],2)]
    lev_dists /= np.max(lev_dists)
    out[inds] += lev_dists
# Pairwise cosine distances for vector columns (0-1 by definition)
cos_dists = pairwise_distances(df[['float1','float2']],metric='cosine')
# Fill in the output matrix
out[inds] += cos_dists[inds]
out.round(2)

Returns:

array([[0.  , 1.11, 2.5 , 1.44],
       [0.  , 0.  , 2.05, 1.83],
       [0.  , 0.  , 0.  , 1.55],
       [0.  , 0.  , 0.  , 0.  ]])
evces
  • 372
  • 8
  • Thanks, that's a great start. What if I have more than 1 column with strings? – TRK Dec 19 '22 at 17:14
  • @TRK See my edit - you could do pairwise Levenshtein distance within each column, and simply sum the results. But keep in mind this simple summing is just one solution, it might not be ideal depending on what you want to do with this distance matrix, the nature of your string data, etc. – evces Dec 19 '22 at 20:50
  • I am not sure how to do this. Can you give me a hint? It seems to be not working when I use 'pairwise_distances' from scikit or 'pdist' from scipy when using a self implemented or pre-implemented levenshtein distance method. I get either the same casting error or it is not working row-wise (but column-wise for pdist). Thank you for your help. – TRK Dec 20 '22 at 13:24
  • I think my latest edited code does what you need. I had the same experience with those pairwise distance methods from the packages, they are not good for Levenshtein. – evces Dec 20 '22 at 13:27
  • It does only for one string column but not for multiple columns. Or am I missing something? And the other workarounds ('pairwise_distances' from scikit or 'pdist' from scipy) are not working somhow. – TRK Dec 20 '22 at 15:10
  • Notice I added a second string column to the example data `'string2'`. The code automatically detects all string columns by checking which contain a string in their first row. Then there is a loop over these columns that computes pairwise levenshtein distance within each column, and cumulatively sums them in `out`. After the loop, the cosine distances for the float columns are added on. – evces Dec 20 '22 at 15:13
  • Oh, thanks. Did not see this. That will work! – TRK Dec 20 '22 at 15:41
  • @TRK your question inspired me to write a long answer to another question, you might be interested: https://stackoverflow.com/a/74870785/20521021 — and please accept this answer if you think it is good! Thanks – evces Dec 21 '22 at 03:33
0

Code:

from sklearn.neighbors import DistanceMetric
from math import radians
import pandas as pd
import numpy as np

cities_df = pd.DataFrame({'city':['bangalore','Mumbai','Delhi','kolkatta','chennai','bhopal'],
    'lat':[12.9716,19.076,28.7041,22.5726,13.0827,23.2599],
                          'lon':[77.5946,72.877,77.1025,88.639,80.2707,77.4126],})

cities_df['lat'] = np.radians(cities_df['lat'])
cities_df['lon'] = np.radians(cities_df['lon'])


dist = DistanceMetric.get_metric('haversine')


cities_df[['lat','lon']].to_numpy()


dist.pairwise(cities_df[['lat','lon']].to_numpy())*6373


pd.DataFrame(dist.pairwise(cities_df[['lat','lon']].to_numpy())*6373,  columns=cities_df.city.unique(), index=cities_df.city.unique())

Result:

enter image description here

halfer
  • 19,824
  • 17
  • 99
  • 186
ASH
  • 20,759
  • 19
  • 87
  • 200
  • Does this also work with the following data frame? ```cities_df = pd.DataFrame({'city':['bangalore','Mumbai','Delhi','kolkatta','chennai','bhopal'],'city2':['bangalore2','Mumbai2','Delhi2','kolkatta2','chennai2','bhopa2l'], 'float' : [1.0,2.0,1.0,2.0,3.0,1.5]})``` – TRK Dec 21 '22 at 12:49