GSDMM Convergence of Clusters (Short Text Clustering)

Question

I am using this GSDMM python implementation to cluster a dataset of text messages. GSDMM converges fast (around 5 iterations) according the inital paper. I also have a convergence to a certain number of clusters, but there are still a lot of messages transferred in each iteration, so a lot of messages are still changing their cluster.

My output looks like:

In stage 0: transferred 9511 clusters with 150 clusters populated 
In stage 1: transferred 4974 clusters with 138 clusters populated 
In stage 2: transferred 2533 clusters with 90 clusters populated
….
In stage 34: transferred 1403 clusters with 47 clusters populated 
In stage 35: transferred 1410 clusters with 47 clusters populated 
In stage 36: transferred 1430 clusters with 48 clusters populated 
In stage 37: transferred 1463 clusters with 48 clusters populated 
In stage 38: transferred 1359 clusters with 48 clusters populated

In the initial paper figure 3 shows the same pattern, the number of clusters in nearly constant.

What I can't figure out is how many messages of their dataset where still transfering. My understanding is, that this number should be as small as possible, in best case zero (so every message "found" the right cluster). So the number of clusters might be converging, but that doens´t say much about the quality of the algorithm/clusters. Is my understanding correct?

It also is a possibility that my data is not good enough to get proper clustering.

Following the image you provided, the paper says that they have 500 topics and converge to 100-150 topics. So it is not zero, but like 20% of the initial value. — Ilya Palachev, Jun 10 '20 at 08:43
Yes, the cluster do converge. I my example, they also converge to 47-48 Cluster. But I´m asking about the high number of transferred messages (in the Algorithm called "clusters"). Do you know what I mean with this further explenation? — simon, Jun 10 '20 at 10:42
I think it's better to ask the author. @ryan-walker can you help us? — Ilya Palachev, Jun 11 '20 at 02:25
I have another picture. The number N in "with N cluster populated" does not change for me and it is equal to K (number of topics hyperparameter). Only the number M in "transferred M clusters" changes (starting from the number of documents (26000 in my case) and ending by the 9000-10000 value). Do you have any idea why N is changing in your case? And what are your hyperparameters (K, alpha, beta)? And does N converge in your case? — Ilya Palachev, Jun 11 '20 at 05:27
I have around 10.000 messages and start with `K=600`, converging to `N=47-48` Cluster. I think it is ok that it doesn´t corverge to a specific number, there might just be some messages that fit well in several clusters.You can also see this behaviour in the figure "TweetSet", it is moving a little bit. My Hyperparameters after some Grid Search with long Runtime are: alpha=`0.01, beta=0.05`. For K I think it is just important, that it is big enought. — simon, Jun 11 '20 at 07:58
Sorry for off-topic, just curious, what is the criteria during the grid search? Did you compute the topic coherence as a quality metric or reviewed each variant manually? — Ilya Palachev, Jun 11 '20 at 10:57
The GSDMM python implementation that I use has a build in function `mgp.score` where you can see, how sure the algorithm is in assigning an input text to a cluster. I use the average of how sure the algorithm is over all input documents to compare different Hyperparameters. It is a metrik that I came up with one my own because I had the same stuggles like you :) I´m also discussing things like that in [this](https://stackoverflow.com/questions/62108771/a-practical-example-of-gsdmm-in-python/62167039?noredirect=1#comment110240491_62167039) post — simon, Jun 13 '20 at 08:18
If you have any more insights on how you use the Algorithm or a diffrent metric, I would be very interested! — simon, Jun 13 '20 at 08:39
Actually, I don't have any specific insights, I'm just using the algorithm from box. — Ilya Palachev, Jun 17 '20 at 07:40
So what kind of metric do you use to evaluate your clusters? — simon, Jun 21 '20 at 20:22
Currently, I haven't done the accurate hyperparameters tuning and the only way I use to evaluate the clusters is to manually see them. But I think that it would be worthwhile to implement some coherence score for this model. For more popular models (like LDA) there is a widely accepted method to evaluate the model using some of the coherence metrics. Are you interested in this direction? — Ilya Palachev, Jun 22 '20 at 03:25
I am interested in coherence scores for this model and similar ones like LDA. I feel like often people use their ground truth data to evaluate their clustering, but of course if you do clustering you maybe don´t have that. For coherence scores without ground truth data I used `log-likelihood`, `u-mass` and `topic coherence` for LDA, but wasn´t really happy with the results. What do you use? — simon, Jun 22 '20 at 15:31

score 1 · Accepted Answer · answered Jun 21 '20 at 20:21

After a deeper dive into the functionality of the GSDMM Algorithm, I can share some new information.

Here is some background information about the algorithm, of course this is not a complete description of how the algorithm works:

• GSDMM is a soft clustering algorithm

• Underlaying the allocations of inputs (e.g. messages) to clusters are distributions (Multinomial distributions with Dirichlet Distributions as their Prior)

• The “Score”-Metric, that shows probability of an input belonging to a cluster, is based on a multinomial distribution and over all clusters adds up to 1

So as long as you don´t have very clear and easy separable clusters there will be inputs that “belong” to several clusters with a significant probability, e.g. Message 1 has a score value of 0.5 for Cluster 1, a score value of 0.4 for Cluster 2 and 0.1 for all the other clusters combined. If there are inputs with score values like that, due to the assignment depending on the multinomial distribution they sometimes will jump for one cluster to another one.

With knowing that I would say it is normal, even after a lot of iterations, to have jumping inputs. To measure the quality of your clustering, you should assign the inputs to the cluster with their highest score value and should not take the clustering based on the last iteration of your training.

Another option would be to leave out inputs that jump a lot or don´t have a cluster with a superior value, because these inputs are not good to fit in clusters (maybe some bad data, of course depending from case to case)

Are you going to "leave out inputs that jump a lot"? The results would be quite interesting. So if you are going to implement this idea, it would be great if you share the implementation, since the GSDMM repository you mention seems to be out of support by the author. — Ilya Palachev, Jun 22 '20 at 03:23

GSDMM Convergence of Clusters (Short Text Clustering)

1 Answers1