Recently, I have working with Weka to cluster data into groups using the built-in EM clusterer. However, while the clustering itself works fine, when I save the output file, I notice that the "probabilities" for being in a cluster were all 0's and 1's. This made me suspicious, as it seems unlikely that Weka could distinguish between clusters with 100% confidence. So, then what I did was I generated data that was essentially random and "unclusterable", if you will, and upon reclustering, I found again, the output probabilities were all 1's and 0's.
Even further, to be sure the clusterer wasn't clustering on some feature that I was completely overlooking, I made a seperate utility to generate a TSNE plot of the random data, and sure enough, it looked random and the clusters the EM clusterer made didn't really make sense, as should be the case for random data.
My question then is this: Why is the ClusterMembership feature of the Weka yielding only 1's and 0's for the probability of being in a cluster even for completely random data? Am I missing something very obvious or is there a deeper issue?
Here is the ClusterMembership documentation and here is the closest related question I could find on SO, but it seems pretty far off from what I want. Any suggestions/ideas are welcome on this, as the only reason I can think of why this would be happening is that there is something fundamentally wrong with the way my data is structured (which seems unlikely, because I have used this data in other learning problems with a high degree of success), or Weka's clustering itself is just not that good, which from my previous question seems like a plausible reason, although I hope this is not the case.
Update: I managed to replicate this problem with the following minimalist .arff file:
@relation 'Test'
@attribute x numeric
@attribute y numeric
@data
{0 1}
{1 1}
{}
{0 1,1 1}
Running this with the ClusterMembership filter (2 clusters), again I get that the probabilities are all 1's or 0's. Note that this clearly does not make sense as there are multiple ways to cluster this data into 2 groups, so giving a probability of 1 for the clusters is not realistic. Also, I should add that I am using Weka 3.8.1.