3

Part of my dataset (in reality my dataset size (106,1800)):

df =

    1           1.1     2           2.1     3           3.1     4           4.1     5           5.1
0   43.1024     6.7498  NaN         NaN     NaN         NaN     NaN         NaN     NaN         NaN
1   46.0595     1.6829  25.0695     3.7463  NaN         NaN     NaN         NaN     NaN         NaN
2   25.0695     5.5454  44.9727     8.6660  41.9726     2.6666  84.9566     3.8484  44.9566     1.8484
3   35.0281     7.7525  45.0322     3.7465  14.0369     3.7463  NaN         NaN     NaN         NaN
4   35.0292     7.5616  45.0292     4.5616  23.0292     3.5616  45.0292     6.7463  NaN         NaN

What I am able to do now based on Tom's answer:

  • I manually wrote 1-st 2 rows like p and q value:

p =

[[45.1024,7.7498],[45.1027,7.7513],[45.1072,7.7568],[45.1076,7.7563]]

q=

[[45.0595,7.6829],[45.0595,7.6829],[45.0564,7.6820],[45.0533,7.6796],[45.0501,7.6775]]

THEN:

__all__ = ['frdist']


def _c(ca, i, j, p, q):

    if ca[i, j] > -1:
        return ca[i, j]
    elif i == 0 and j == 0:
        ca[i, j] = np.linalg.norm(p[i]-q[j])
    elif i > 0 and j == 0:
        ca[i, j] = max(_c(ca, i-1, 0, p, q), np.linalg.norm(p[i]-q[j]))
    elif i == 0 and j > 0:
        ca[i, j] = max(_c(ca, 0, j-1, p, q), np.linalg.norm(p[i]-q[j]))
    elif i > 0 and j > 0:
        ca[i, j] = max(
            min(
                _c(ca, i-1, j, p, q),
                _c(ca, i-1, j-1, p, q),
                _c(ca, i, j-1, p, q)
            ),
            np.linalg.norm(p[i]-q[j])
            )
    else:
        ca[i, j] = float('inf')

    return ca[i, j]

THEN:

def frdist(p, q):

    # Remove nan values from p
    p = np.array([i for i in p if np.any(np.isfinite(i))], np.float64)
    q = np.array([i for i in q if np.any(np.isfinite(i))], np.float64)

    len_p = len(p)
    len_q = len(q)

    if len_p == 0 or len_q == 0:
        raise ValueError('Input curves are empty.')

    # p and q will no longer be the same length
    if len(p[0]) != len(q[0]):
        raise ValueError('Input curves do not have the same dimensions.')

    ca = (np.ones((len_p, len_q), dtype=np.float64) * -1)

    dist = _c(ca, len_p-1, len_q-1, p, q)
    return(dist)

frdist(p, q)

It works. But how I could apply p and q to the whole dataset? Not by choosing row by row?

Finally I need to get 106 to 106 symmetric matrix with 0 diagonal

kmaork
  • 5,722
  • 2
  • 23
  • 40
Mamed
  • 1,102
  • 8
  • 23
  • You may remove NaN values from `p` and also remove corresponding values from `q`. See, for example, https://stackoverflow.com/questions/11620914/removing-nan-values-from-an-array –  Aug 26 '19 at 11:40
  • @Poolka Impossible, Because min number of value is 1 and the max is 1500 – Mamed Aug 26 '19 at 11:58
  • I don't understand your because part. How does it prevent to simply drop all NaNs from `p`? Suppose you have 100 values and 2 NaNs among them -> drop NaNs -> you have 98 values and you can perform calculations. –  Aug 26 '19 at 12:46
  • @Poolka Sorry. My fault. It is not real dataset. In real dataset p is 1 values and q has 1800 values – Mamed Aug 26 '19 at 12:49
  • @anky_91 `p` - first row and `q` - second row – Mamed Aug 28 '19 at 12:18
  • okay, so we drop the second row columns where the first row columns are NaN, else there will be a length mismatch – anky Aug 28 '19 at 12:42
  • @anky_91Yes. We need to remove them. Tom presented the method, but I am not able to apply it to my dataset – Mamed Aug 28 '19 at 12:46
  • @Mamed What problems do you get when applying my method to your dataset? – Ted Aug 28 '19 at 12:56
  • @Tom Give me 5 min, I'll edit my question based on your answer and show the issue. ok? – Mamed Aug 28 '19 at 12:57
  • @Tom I have edited the question – Mamed Aug 28 '19 at 13:29
  • 1
    It looks like you delete most of the question as I can only see 2 lines without any code – Eric Villemure Aug 29 '19 at 19:16
  • @EricVillemure My code was wrong from the beginning. Now trying to fix and edit the question again – Mamed Aug 29 '19 at 21:32

2 Answers2

3

Remove NaN values

Simple and straightforward:

p = p[~np.isnan(p)]


Calculate Fréchet distances for whole dataset

The easiest way is to use pairwise distances calculation pdist from SciPy. It takes an m observations by n dimensions array, so we need to reshape our row arrays using reshape(-1,2) inside frdist. pdist returns the condensed (upper triangular) distance matrix. We use squareform to get the m x m symmetric matrix with 0 diagonal as requested.

import pandas as pd
import numpy as np
import io
from scipy.spatial.distance import pdist, squareform

data = """    1           1.1     2           2.1     3           3.1     4           4.1     5           5.1
0   43.1024     6.7498  NaN         NaN     NaN         NaN     NaN         NaN     NaN         NaN
1   46.0595     1.6829  25.0695     3.7463  NaN         NaN     NaN         NaN     NaN         NaN
2   25.0695     5.5454  44.9727     8.6660  41.9726     2.6666  84.9566     3.8484  44.9566     1.8484
3   35.0281     7.7525  45.0322     3.7465  14.0369     3.7463  NaN         NaN     NaN         NaN
4   35.0292     7.5616  45.0292     4.5616  23.0292     3.5616  45.0292     6.7463  NaN         NaN
"""
df = pd.read_csv(io.StringIO(data), sep='\s+')

def _c(ca, i, j, p, q):

    if ca[i, j] > -1:
        return ca[i, j]
    elif i == 0 and j == 0:
        ca[i, j] = np.linalg.norm(p[i]-q[j])
    elif i > 0 and j == 0:
        ca[i, j] = max(_c(ca, i-1, 0, p, q), np.linalg.norm(p[i]-q[j]))
    elif i == 0 and j > 0:
        ca[i, j] = max(_c(ca, 0, j-1, p, q), np.linalg.norm(p[i]-q[j]))
    elif i > 0 and j > 0:
        ca[i, j] = max(
            min(
                _c(ca, i-1, j, p, q),
                _c(ca, i-1, j-1, p, q),
                _c(ca, i, j-1, p, q)
            ),
            np.linalg.norm(p[i]-q[j])
            )
    else:
        ca[i, j] = float('inf')

    return ca[i, j]

def frdist(p, q):

    # Remove nan values and reshape into two column array
    p = p[~np.isnan(p)].reshape(-1,2)
    q = q[~np.isnan(q)].reshape(-1,2)

    len_p = len(p)
    len_q = len(q)

    if len_p == 0 or len_q == 0:
        raise ValueError('Input curves are empty.')

    # p and q will no longer be the same length
    if len(p[0]) != len(q[0]):
        raise ValueError('Input curves do not have the same dimensions.')

    ca = (np.ones((len_p, len_q), dtype=np.float64) * -1)

    dist = _c(ca, len_p-1, len_q-1, p, q)
    return(dist)

print(squareform(pdist(df.values, frdist)))

Result:

[[ 0.         18.28131545 41.95464432 29.22027212 20.32481187]
 [18.28131545  0.         38.9573328  12.59094238 20.18389517]
 [41.95464432 38.9573328   0.         39.92453004 39.93376923]
 [29.22027212 12.59094238 39.92453004  0.         31.13715882]
 [20.32481187 20.18389517 39.93376923 31.13715882  0.        ]]


No need to re-invent the wheel

Fréchet distance calculation is already provided by similaritymeasures. So the following will give you the same result as above:

from scipy.spatial.distance import pdist, squareform
import similaritymeasures

def frechet(p, q):
    p = p[~np.isnan(p)].reshape(-1,2)
    q = q[~np.isnan(q)].reshape(-1,2)
    return similaritymeasures.frechet_dist(p,q)

print(squareform(pdist(df.values, frechet))) 
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Hello, one moment: ` NameError: name 'squareform' is not defined` – Mamed Sep 01 '19 at 19:08
  • My fault. Entered but forgot to run. Thanks! P.S. Your code and my code are working for small data. With my real data it gives me `RecursionError: maximum recursion depth exceeded in comparison`. I think I will create a new question, but maybe you can advice me something to avoid that? – Mamed Sep 01 '19 at 19:13
  • Does the `RecusrionError` also occur with `similaritymeasures.frechet_dist`? – Stef Sep 01 '19 at 19:26
  • Yes. I can split my data, but looking for a better solution – Mamed Sep 01 '19 at 19:27
  • The reason of `RecusrionError` is the part with `if` `elif` btw – Mamed Sep 01 '19 at 19:29
  • 1
    you may try and increase the [`recursionlimit`](https://docs.python.org/3/library/sys.html#sys.setrecursionlimit), e.g. `sys.setrecursionlimit(1500)` – Stef Sep 01 '19 at 19:42
  • see also https://stackoverflow.com/questions/3323001/what-is-the-maximum-recursion-depth-in-python-and-how-to-increase-it – Stef Sep 01 '19 at 19:59
0

I think the only change you'd have to make would be inside the frdist function, to first remove the nan values from p. This would then necessitate removing the condition that p and q are the same length, but I think that should be OK as you say yourself that p has 1 value and q has 1800 values.

def frdist(p, q):

    # Remove nan values from p
    p = np.array([i for i in p if np.any(np.isfinite(i))], np.float64)
    q = np.array(q, np.float64)

    len_p = len(p)
    len_q = len(q)

    if len_p == 0 or len_q == 0:
        raise ValueError('Input curves are empty.')

    # p and q no longer have to be the same length
    if len(p[0]) != len(q[0]):
        raise ValueError('Input curves do not have the same dimensions.')

    ca = (np.ones((len_p, len_q), dtype=np.float64) * -1)

    dist = _c(ca, len_p-1, len_q-1, p, q)
    return(dist)

Then gives:

frdist(p, q)
1.9087938076177846
Ted
  • 1,189
  • 8
  • 15