0

I have a large set of binary data that I need to cluster. For example

[[0 1 1 0 ... 0 1 0 1 ],
 [1 0 1 1 ... 0 0 1 1 ],
 ...
 [0 0 1 0 ... 1 0 1 1 ]]

From what I've read, the best clustering algorithms for binary data are hierarchical such as agglomerative clustering. So I implemented that using scikit.

I have a very large data set with new data coming in all the time which I would like to cluster into a previously clustered group. So my thinking was to take a random sample of the existing data, run the AgglomerativeClustering on it and save the results to a file using joblib.

Then when a new set of data arrives, load the previously cluster up and call predict() to figure out where it would fall. It's almost like I'm training a cluster similar to a classifier but without the labels. The problem is that AgglomerativeClustering doesn't have a predict() method. Other clustering algorithms in scikit do have predict() such as KMeans but based on my research, that's not a good algorithm to use when dealing with binary data.

So I'm stuck. I don't want to have to run the clustering every single time new data arrives because hierarchical algorithms to do scale well with a lot of data but I'm not sure which algorithm to use that would work with binary data and also provide a predict() functionality.

Is there a way I can transform the binary data so that other algorithms, like KMeans, can provide useful outputs? Or is there a completely different algorithm not implemented in scikit that would work? I'm not tied to scikit so switching is not an issue.

JQPx
  • 69
  • 1
  • 5
  • 1
    Personally I think your question will be easier to read (and answer) if you try to abbreviate it – Itay May 04 '19 at 16:31

1 Answers1

0

When you want to predict, use a classifier, not clustering.

Here, the most appropriate classifier would likely be a 1NN classifier. For performance reasons I'd choose DT or SVM instead though.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Predict may of been the wrong word (although as I mentioned various clustering algorithms have a predict() method). I don't have clean data for labels to train so I can't predict at this point with a classifier. – JQPx May 06 '19 at 02:54
  • It is rather an exception that a clustering has a "predict", and you don't necessarily get the same result as when fitting to the data. That is rather an artifact not sklearn's API design that originally wanted everything to have `fit` and `predict` (which causes the odd case that you have `fit(X,y)` for clustering algorithms, and y is it ignored. What you want is a classifier trained to predict the cluster label. – Has QUIT--Anony-Mousse May 06 '19 at 05:58