How to prepare data to apply RandomForest?

Question

I have csv file which contain userId, MovieId,Rating.I want to convert this file to containing label,features .

like in How to prepare data into a LibSVM format from DataFrame?

I need to separete rating column as afile and determine LabeledPoint for label.For applying random forest algorithm I need label column in file but it doesn't exit.

val pca = new PCA()
    .setInputCol("features")
    .setOutputCol("pcaFeatures")
    .setK(3)
    .fit(assembled_df)

    val pcaTrainingData = pca.transform(assembled_df).select("id","features","pcaFeatures")
   val labeled = pca.transform(assembled_df).rdd.map(row => LabeledPoint(
   row.getAs[Double]("label"),   
   row.getAs[org.apache.spark.mllib.linalg.Vector]("pcaFeatures")
))

    val numClasses = 10
    val categoricalFeaturesInfo = Map[Int, Int]()
    val numTrees = 10 // Use more in practice.
    val featureSubsetStrategy = "auto" // Let the algorithm choose.
    val impurity = "gini"
    val maxDepth = 20
    val maxBins = 32

    val model = RandomForest.trainClassifier(labeled, numClasses, categoricalFeaturesInfo,
        numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

How to make label column?

not sure what file you mean, and if there is any error produced by above — ELinda, Feb 06 '20 at 06:31
I haven't label column in file .can I apply random forest without label column ? error " Field "label" does not exist." — Remaz, Feb 06 '20 at 14:25
This means that the `pca.transform(assembled_df).rdd` is lacking a column called "label" -- perhaps you can use "withColumn" to add it if it's based on existing columns or just change your `map` call (row => LabeledPoint...) to reference only existing columns — ELinda, Feb 07 '20 at 00:35

How to prepare data to apply RandomForest?

0 Answers0