I am using this GSDMM python implementation to cluster a dataset of text messages. GSDMM converges fast (around 5 iterations) according the inital paper. I also have a convergence to a certain number of clusters, but there are still a lot of messages transferred in each iteration, so a lot of messages are still changing their cluster.
My output looks like:
In stage 0: transferred 9511 clusters with 150 clusters populated
In stage 1: transferred 4974 clusters with 138 clusters populated
In stage 2: transferred 2533 clusters with 90 clusters populated
….
In stage 34: transferred 1403 clusters with 47 clusters populated
In stage 35: transferred 1410 clusters with 47 clusters populated
In stage 36: transferred 1430 clusters with 48 clusters populated
In stage 37: transferred 1463 clusters with 48 clusters populated
In stage 38: transferred 1359 clusters with 48 clusters populated
In the initial paper figure 3 shows the same pattern, the number of clusters in nearly constant.
What I can't figure out is how many messages of their dataset where still transfering. My understanding is, that this number should be as small as possible, in best case zero (so every message "found" the right cluster). So the number of clusters might be converging, but that doens´t say much about the quality of the algorithm/clusters. Is my understanding correct?
It also is a possibility that my data is not good enough to get proper clustering.