K-means clustering algorithm in pyspark: syntax for defining the initial seed

Question

I am analysing a k-means clustering algorithm in pyspark and I have a syntax doubt. This is the relevant part of the code:

from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
import numpy as np
kmeans_modeling = KMeans(k = 5, seed = 0)
model = kmeans_modeling.fit(data.select("parameters"))

What does the seed = 0 mean? Certainly we cannot initialize all the clusters with the seed on the same point, or we wouldn't obtain distinct clusters right?

Seed is not referring to any ML-related seed. It's referring to the seed for the random number generator. KMeans have randomness associated with it right? Which means that every time you run the code, you will get a slightly different result. But if you set seed to a particular number, the randomness is always going to behave similarly and you will get the same result for every run. See this. https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do — Ananda, Dec 23 '20 at 11:41
You can set Initial cluster centers using KMeans (initialModel param) from MLlib instead of ML https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/clustering.html#KMeans.train — chlebek, Dec 23 '20 at 13:00

score 1 · Accepted Answer · answered Dec 23 '20 at 12:07

According to the docs, this seed parameter is indeed a random seed, as suggested in the comments. The seed ensures that your machine learning run is reproducible by ensuring that the (pseudo)random number generator gives the same output in every run, provided that the input (including the random seed) is the same.

If you're looking for cluster initialization options, you can see the docs as well. There are two options: initMode = "random" or initMode = "k-means||", where the latter is the default.

K-means clustering algorithm in pyspark: syntax for defining the initial seed

1 Answers1