0

I have something like that:

wines = spark.table("dane_nowe_csv")
selected = wines.select("price")

Price is a double. The question is how can i convert this "selected" to valid type to use with that:

clusters = KMeans.train(selected, 2, maxIterations=10, initializationMode="random")

I'm trying to do that a lot of time today, I searched dozens of topics and there is always some errors and I have a feeling that there is some easy way to do this.

Helosze
  • 333
  • 2
  • 7
  • 20
  • And with `DataFrames` use [`pyspark.ml.clustering.KMeans`](http://spark.apache.org/docs/latest/ml-clustering.html#k-means) not `pyspark.mllib.clustering.KMeans` – Alper t. Turker May 23 '18 at 19:48

1 Answers1

0
wines = spark.table("dane_nowe_csv")
selected = wines.select("price").map(s => s.getAs[Vector])
clusters = KMeans.train(selected, 2, maxIterations=10,initializationMode="random")

KMeans take rdd as input not dataframe or column.

Mugdha
  • 112
  • 9