TL;DR: how to normalize stream data, given that the whole data set is not available and you are dealing with clustering for evolving environments
Hi! I'm currently studying dynamic clustering for non-stationary data streams. I need to normalize the data because all features should have the same impact in the final clustering, but I don't know how to do it .....
I need to apply a standard normalization. My initial approach was to:
- Fill a buffer with initial data points
- Use those data points to get mean and standard deviation
- Use those measures to normalize the current data points
- Send those points normalized to the algorithm one by one
- Use the previous measures to keep normalizing incoming data points for a while
- Every some time calculate again mean and standard deviation
- Represent the current micro clusters centroids with the new measures (having the older ones it shouldn't be a problem to go back and normalize again)
- Use the new measures to keep normalizing incoming data points for a while
- And so on ....
The thing is that normalizing the data should not get involved with what the clustering algorithm does ... I mean, you are not able to tell the clustering algorithm 'ok, the micro clusters you have till now need to be normalized with this new mean and stdev' ... I mean, I developed an algorithm and I could do this, but I am also using existing algorithms (clustream and denstream) and it does not feel right to me to modify them to be able to do this ....
Any ideas?
TIA