Kmeans using DataFrame column

Question

I have something like that:

wines = spark.table("dane_nowe_csv")
selected = wines.select("price")

Price is a double. The question is how can i convert this "selected" to valid type to use with that:

clusters = KMeans.train(selected, 2, maxIterations=10, initializationMode="random")

I'm trying to do that a lot of time today, I searched dozens of topics and there is always some errors and I have a feeling that there is some easy way to do this.

And with `DataFrames` use [`pyspark.ml.clustering.KMeans`](http://spark.apache.org/docs/latest/ml-clustering.html#k-means) not `pyspark.mllib.clustering.KMeans` — Alper t. Turker, May 23 '18 at 19:48

score 0 · Answer 1 · answered May 24 '18 at 06:27

0

wines = spark.table("dane_nowe_csv")
selected = wines.select("price").map(s => s.getAs[Vector])
clusters = KMeans.train(selected, 2, maxIterations=10,initializationMode="random")

KMeans take rdd as input not dataframe or column.

answered May 24 '18 at 06:27

Mugdha

112
9

Kmeans using DataFrame column

1 Answers1