2

TL;DR: how to normalize stream data, given that the whole data set is not available and you are dealing with clustering for evolving environments

Hi! I'm currently studying dynamic clustering for non-stationary data streams. I need to normalize the data because all features should have the same impact in the final clustering, but I don't know how to do it .....

I need to apply a standard normalization. My initial approach was to:

  1. Fill a buffer with initial data points
  2. Use those data points to get mean and standard deviation
  3. Use those measures to normalize the current data points
  4. Send those points normalized to the algorithm one by one
  5. Use the previous measures to keep normalizing incoming data points for a while
  6. Every some time calculate again mean and standard deviation
  7. Represent the current micro clusters centroids with the new measures (having the older ones it shouldn't be a problem to go back and normalize again)
  8. Use the new measures to keep normalizing incoming data points for a while
  9. And so on ....

The thing is that normalizing the data should not get involved with what the clustering algorithm does ... I mean, you are not able to tell the clustering algorithm 'ok, the micro clusters you have till now need to be normalized with this new mean and stdev' ... I mean, I developed an algorithm and I could do this, but I am also using existing algorithms (clustream and denstream) and it does not feel right to me to modify them to be able to do this ....

Any ideas?

TIA

onofricamila
  • 930
  • 1
  • 11
  • 20
  • You fit on a part of you data then you transform them. As you receive the remaining observations, you just "transform" them. No more fitting, and no more Mean and STD calculations. But you have to have some amount of data first to "fit" first. In this case, you don't get involved with what the clustering algorithm is doing. – Yahya Apr 01 '20 at 14:42
  • Hi yahya! Thanks for commenting. The thing is that I am working with concept drift: the data is known to evolve. Because of this, I have to keep measuring the statistics in different windows, to see the evolution. I think there is no alternative and I will have to do what is stated on the question. If the data set is known to be stationary, your approach is more than applicable. Thanks again!!! – onofricamila Apr 01 '20 at 15:18
  • 1
    Then the approach you are following is absolutely correct. You need to re-fit the clusters every time you receive x amount of observations!. All the best. – Yahya Apr 01 '20 at 15:21

2 Answers2

1

As more data streams in, the estimated standardization parameters (e.g, mean and std) are updated and converge further to the true values [1, 2, 3]. In evolving environments, it is even more pronounced as the data distributions are now time-varying too [4]. Therefore, the more recent streamed samples that have been standardized using the more recent estimated standardization parameters are more accurate and representative.

A solution is to merge the present with a partial reflection of the past by embedding a new decay parameter in the update rule of your clustering algorithm. It boosts the contribution of the more recent samples that have been standardized using the more recent distribution estimates. You can see an implementation of this idea in Apache Sparks MLib [5, 6, 7]:

enter image description here

where the α is the new decay parameter; lower α makes the algorithm favor the more recent samples more.

Reveille
  • 4,359
  • 3
  • 23
  • 46
  • Hi! Thanks for answering. I like the idea of giving more weight to new samples. However, it seems there is no way out of modifying the existing stream clustering algorithms implementations (like the ones provided by MOA) to change the representation of the micro clusters when evolution is noticed. Suppose you don't do that and you have a micro cluster which centroid is 0 (= old avg). Then, you send to the algo a new sample normalized w the new statistics. If sample = new avg, sample value = 0, so the algorithm will 'store it' in the previous micro cluster, when it shouldn't, bc the avgs are != – onofricamila Apr 05 '20 at 01:49
  • What is MOA? I'm not familiar with it. – Reveille Apr 05 '20 at 01:57
  • [https://github.com/Waikato/moa] It's a framework for Big Data mining, which happens to have the implementation of a lot of well known stream clustering algorithms. Pretty cool - had some trouble using the graphical interface though – onofricamila Apr 05 '20 at 02:00
  • Oh, ok. Thanks! What clustering algorithm do you use, specifically? There may be a way to apply the Apache Sparks MLib's idea for k-means to that. We should also keep in mind that here an underlying assumption is that the process is evolutionary (and not "revolutionary"). If the statistics change radically, our past normalizations and learnings are not representative at all and so they need to be reset from scratch for all the data. – Reveille Apr 05 '20 at 02:13
  • 1
    I'm using CluStream, and DenStream. It felt weird to modify them, and it's strange how dynamic stream normalization is not taken into account in the implementations. I will take a look at the papers you cited – onofricamila Apr 05 '20 at 02:24
1

Data normalization affects clustering for algorithms that depend on the L2 distance. Therefore you can't really have a global solution to your question.

If your clustering algorithm supports it, one option would be to use clustering with a warm-start in the following way:

  • at each step, find the "evolved" clusters from scratch, using the samples re-normalized according to the new mean and std dev
  • do not initialize clusters randomly, but instead, use the clusters found in the previous step as represented in the new space.
serxio
  • 11
  • 1
  • Hi! Thanks for answering. I didn't understand the following: "Data normalization affects clustering for algorithms that depend on the L2 distance". I mean, normalization is used in order to give the same importance to all the features (as all end up in the same 'scale'). That said, it doesn't matter what distance metric you are using when clustering. Please correct me if I'm mistaken. Then, when it comes to the clustering approach, the learning is online so it needs to be initialized once. Re-normalizing the new samples would be inconsistent with the current normalized micro clusters. – onofricamila Apr 05 '20 at 01:27
  • I think there's no option but to follow the approach mentioned in my question (I wanted to delete the question but could not bc of the bounty). The thing is that I didn't like the idea of messing with the implementations of the algorithms I use, but you have to do so in order to re-represent the micro clusters with the new statistics. I just found it weird that the MOA implementations do not consider stream 'dynamic' normalization. – onofricamila Apr 05 '20 at 01:31
  • If the MOA implementations you are referring to are using the L2 distance, then, instead of normalizing the samples, you could instead adjust the code to use the Mahalanobis distance and dynamically adjust the covariance matrix. This will allow you to not change the coordinates of the samples. But note that this would not solve the conceptual problem of having clusters that might not have been created if the order of samples provided to the algorithm had been different. – serxio Apr 06 '20 at 08:11