24

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:

vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)

Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:

vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)

I appreciate your helps.

user3430235
  • 419
  • 1
  • 4
  • 12

4 Answers4

35

When using k-means, you want to set the random_state parameter in KMeans (see the documentation). Set this to either an int or a RandomState instance.

km = KMeans(n_clusters=number_of_k, init='k-means++', 
            max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)

This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.

I'm not sure about the spectral clustering example though. From the documentation on the random_state parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.

Roger Fan
  • 4,945
  • 31
  • 38
  • Thanks for the hint on Kmean. Does the random state setting really effect the results? For example, if I set random_state=2222, will it change the results much? I'll try also and see. Regarding the spectral clustering, I checked the documentation prior to posting this question but not much about the initiation. They have a random state though that I will set it like the one in Kmean and see how it will change. Thanks again. – user3430235 Sep 18 '14 at 21:56
  • @user3430235 I think it all depends on your data. I haven't used it extensively, but I get the impression that k-means is actually fairly sensitive to the starting value. Of course, that's part of why k-means++ was developed, to get more consistently good starting values, but it's still probably an issue worth considering. Another common strategy is to run it multiple times with different seeds and pick the best one. – Roger Fan Sep 18 '14 at 21:59
  • By default the implementation actually runs K-Means 10 times and uses the best resulting clustering. So yes, it does effect the output in all but the trivial cases. – Andreas Mueller Sep 21 '14 at 16:56
  • @AndreasMueller if I use 10 `n_init` and specify the `random_state`, as `n_init=10, random_state=3425` , does this make sense? `n_init` is the number of time the k-means algorithm will be run with different centroid seeds. Will the centroids change or not due to the fixed `random_state` ?? – seralouk Aug 05 '19 at 12:47
  • The random state is set at the beginning, not for each initialization, for the obvious reasons... – Has QUIT--Anony-Mousse Aug 05 '19 at 23:35
  • I had similar problems when running the code in different computers. You may want to set up the random generator instance in the random_state to avoid that for def `random_state=np.random.RandomState(12345)`. – Rafael Valero Oct 21 '20 at 10:53
6

As the others already noted, k-means is usually implemented with randomized initialization. It is intentional that you can get different results.

The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result.

In my opinion, when the results vary highly from run to run, this indicates that the data just does not cluster well with k-means at all. Your results are not much better than random in such a case. If the data is really suited for k-means clustering, the results will be rather stable! If they vary, the clusters may not have the same size, or may be not well separated; and other algorithms may yield better results.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • 3
    if I use `n_init=10` and specify the `random_state`, as `n_init=10, random_state=0` , does this make sense? `n_init` is the number of time the k-means algorithm will be run with different centroid seeds. Will the centroids change or not due to the fixed `random_state`?? – seralouk Aug 05 '19 at 13:11
1

I had a similar issue, but it's that I wanted the data set from another distribution to be clustered the same way as the original data set. For example, all color images of the original data set were in the cluster 0 and all gray images of the original data set were in the cluster 1. For another data set, I want color images / gray images to be in cluster 0 and cluster 1 as well.

Here is the code I stole from a Kaggler - in addition to set the random_state to a seed, you use the k-mean model returned by KMeans for clustering the other data set. This works reasonably well. However, I can't find the official scikit-Learn document saying that.

# reference - https://www.kaggle.com/kmader/normalizing-brightfield-stained-and-fluorescence
from sklearn.cluster import KMeans

seed = 42
def create_color_clusters(img_df,  cluster_count = 2, cluster_maker=None):
    if cluster_maker is None:
        cluster_maker = KMeans(cluster_count, random_state=seed)
        cluster_maker.fit(img_df[['Green', 'Red-Green', 'Red-Green-Sd']])

    img_df['cluster-id'] = np.argmin(cluster_maker.transform(img_df[['Green', 'Red-Green', 'Red-Green-Sd']]),-1)


    return img_df, cluster_maker

# Now K-Mean your images `img_df` to two clusters
img_df, cluster_maker = create_color_clusters(img_df, 2)
# Cluster another set of images using the same kmean-model
another_img_df, _ = create_color_clusters(another_img_df, 2, cluster_maker)

However, even setting random_state to a int seed cannot ensure the same data will always be grouped in the same order across machines. The same data may be clustered as group 0 on one machine and clustered as group 1 on another machine. But at least with the same K-Means model (cluster_maker in my code) we make sure data from another distribution will be clustered in the same way as the original data set.

Nicole Finnie
  • 1,370
  • 13
  • 12
0

Typically when running algorithms with many local minima it's common to take a stochastic approach and run the algorithm many times with different initial states. This will give you multiple results, and the one with the lowest error is usually chosen to be the best result.

When I use K-Means I always run it several times and use the best result.

mattnedrich
  • 7,577
  • 9
  • 39
  • 45