5

I am trying to cluster time series data in Python using different clustering techniques. K-means didn't give good results. The following images are what I have after clustering using agglomerative clustering. I also tried Dynamic Time warping. These two seem to give similar results.

What I would ideally like to have is two different clusters for the time series in the second image. The first image is a cluster for rapid increases. The second for no increase kind of like stable and the third is a cluster for decreasing trends. I would like to know which time series are stable as well as popular (by popular here, I mean high count). I tried hierarchical clustering but the results showed way too many hierarchies and I am not sure how to pick the level of hierarchy. Can someone shed light on how to go about splitting the time series in the second image into two different clusters, one with low counts and the other with high counts? Is it possible to do it? Or should I just visually pick a threshold to cut them into two?

Cluster with rapid increases:

enter image description here

Cluster with stable counts:

enter image description here

Cluster with decreasing trends:

enter image description here

This is very very vague but this is the result of my hierarchical clustering.

enter image description here I know this particular image is not useful at all but this is like a dead end for me as well.

In general, if you would like to differentiate trends, say for instance for YouTube videos, how do only some get picked up for the "trending" section and some others for "trending this week" section? I understand the "trending" section videos are the ones that show similar characteristics to the first image. "Trending this week" section has a collection of videos which have very high view counts but are quiet stable in terms of counts (i.e. not showing rapid increases). I know that in case of YouTube, there are many many other factors that are considered in addition to just view counts. With the second image, what I am trying to do is similar to "trending this week" section. I would like to pick the ones that have very high counts. How do I split the time series in this case?

I know DTW captures trends. DTW gave the same results as the above images. It has identified the trend in the second image which is "stable". But it doesn't capture the "count" element here. I want both the trend as well as the count to be captured , in this case stable and high count.

The above images are time series clustered based on counts. Am I missing out on any other clustering techniques that could achieve this? Even with just counts, how do I cluster differently according to my needs?

Any ideas would be much appreciated. Thanks in advance!

Gingerbread
  • 1,938
  • 8
  • 22
  • 36
  • 2
    It's not about missing any clustering techniques. If you feed K-means (or any other algo) with the raw data, then the results won't be good. You need to construct features out of the time series (like average day-over-day increase, number of times the next observation is above the previous one and so on). Regarding the high counts, I think you should define a threshold yourself. No algorithm will do this for you. – Stergios Aug 10 '17 at 13:56
  • Can you edit your question saying what clustering technique you tried with DTW as distance and what all distance metrics you tried for K-Means clustering apart from Euclidean? – nth-attempt Aug 10 '17 at 19:02
  • K-Means with euclidean distance does not by itself make use of the time series. To see that you can just shuffle the time series and you should get same clusters since the distance is euclidean. @Stergios You are essentially trying to build time based features to feed it to K-Means. Do you know any other clustering methods where you can just directly do clustering on raw time series? One thing I know is to use DTW as distance and use hierarchical clustering. – nth-attempt Aug 10 '17 at 19:09
  • @ultramarine I'm not aware of any algorithm that would take raw time-series and cluster them. – Stergios Aug 11 '17 at 05:24
  • Improve your preprocessing and feature extraction! – Has QUIT--Anony-Mousse Aug 11 '17 at 06:53

2 Answers2

0

The best thing you can do is to extract some features form your time series. The first feature to extract in your case is the trend linear trend estimation

Another thing you can do is to cluster the cumulative version of your time series like suggested and explained in this other post: Time series distance metrics

paolof89
  • 1,319
  • 5
  • 17
  • 31
0

You can use DTW to cluster trends by computing the total min distance, see my answer here for another similar question. I had a problem that is very close to this and I ended up with deploying my own python package for this purpose. Check this for details. You can also see a demo here.

Dogan Askan
  • 1,160
  • 9
  • 22