I have an RDD[LabeledPoint]
intended to be used within a machine learning pipeline. How do we convert that RDD
to a DataSet
? Note the newer spark.ml
apis require inputs in the Dataset
format.

- 322,348
- 103
- 959
- 935

- 58,982
- 91
- 316
- 560
1 Answers
Here is an answer that traverses an extra step - the DataFrame
. We use the SQLContext
to create a DataFrame
and then create a DataSet
using the desired object type - in this case a LabeledPoint
:
val sqlContext = new SQLContext(sc)
val pointsTrainDf = sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
Update Ever heard of a SparkSession
? (neither had I until now..)
So apparently the SparkSession
is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:
Spark 2.0.0+ approaches
Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext
approach: no longer is it necessary to first create a DataFrame
.
val sparkSession = SparkSession.builder().getOrCreate()
val pointsTrainDf = sparkSession.createDataset(training)
val model = new LogisticRegression()
.train(pointsTrainDs.as[LabeledPoint])
Second way for Spark 2.0.0+ Credit to @zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val trainDs = training.toDS()
Traditional Spark 1.X and earlier approach
val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**
See also: How to store custom objects in Dataset? by the esteemed @zero323 .

- 1
- 1

- 58,982
- 91
- 316
- 560
-
1How about `training.toDS`? – zero323 May 29 '16 at 20:04
-
@zero323 ah, I see I need to `import sqlContext._`.Updating the answer. – WestCoastProjects May 29 '16 at 20:11
-
@zero323 You have added sufficient info - feel free to add your own answer – WestCoastProjects May 29 '16 at 21:16
-
You're gonna complain there is no fun in answering if I do :D To be serious I don't much to add and a single consistent reference is much better. – zero323 May 30 '16 at 06:49
-
1@zero323 So you noticed that comment from a week ago . I was concerned you were taking it too seriously ;). It was meant as respect for your knowledge. You also have a v nice approach/attitude. – WestCoastProjects May 30 '16 at 07:05
-
1Me nice? You must have mistaken me with my good twin :) BTW: it could be a good idea to link http://stackoverflow.com/q/36648128/1560062 here. – zero323 May 30 '16 at 10:58