2

I'm trying to cluster time series. The intra-cluster elements have same shapes but different scales. Therefore, I would like to use a correlation measure as metric for clustering. I'm trying correlation or pearson coefficient distance (any suggestion or alternative is welcome). However, the following code returns error when I run Z = linkage(dist) because there are some NaN values in dist. There are not NaN values in time_series, this is confirmed by

np.any(isnan(time_series))

which returns False

from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage

dist = pdist(time_series, metric='correlation') 
Z = linkage(dist)
fig = plt.figure()
dn = dendrogram(Z)
plt.show()

As alternative, I will use pearson distance

from scipy.stats import pearsonr

def pearson_distance(a,b):
    return 1 - pearsonr(a,b)[0]

dist = pdist(time_series, pearson_distance)`

but this generates some runtime warnings and takes a lot of time.

user2614596
  • 630
  • 2
  • 11
  • 30

1 Answers1

1
scipy.pdist(time_series, metric='correlation')

If you take a look at the manual, the correlation options divides by the difference. So it could be that you have two timestamps that are the same, and dividing zero by zero gives us NaN.

Dorian
  • 1,439
  • 1
  • 11
  • 26
  • Ok,i verified that when one of the two sequences has the Same values for all the timesteps the coefficient is nan. How should i handle this case? – user2614596 Nov 15 '17 at 18:20
  • This really depends on your case. You could either ignore and delete these entries, this would be the case if they have no physical/whatsoever meaning. Or you could set them to zero, but I'm not sure about the implications regarding the correlation interpretation. Kind of your choice. – Dorian Nov 16 '17 at 06:36
  • I need to cluster these series, therefore I need a metric to say if two series a and b are similar (in shape, regardless the scale) or not. – user2614596 Nov 16 '17 at 10:01
  • then just delete these values. Having the same value for all timesteps gives you a constant over time, so basically no temporal information. An `if` condition that deletes the values should do it in that case. (consider marking the question as answered in that case...) – Dorian Nov 16 '17 at 10:07
  • what if I add an epsilon to the last value of any constant series? so that this very little perturbation will cause variance different than zero – user2614596 Nov 16 '17 at 10:22
  • you will change the correlation of your timeseries. Thats again your choice to make. I don't have enough information about your problem to see the impact of that variation. – Dorian Nov 16 '17 at 11:42