3

I am performing a naive Bayes classification in Spark/Scala. It seems to work OK, the code is:

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer

val dfLemma2 = dfLemma.withColumn("racist", 'racist.cast("String"))

val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")
val indexed = indexer.fit(dfLemma2).transform(dfLemma2)
indexed.show()


val hashingTF = new HashingTF()
  .setInputCol("lemma").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(indexed)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("features", "indexracist").take(3).foreach(println)
val changedTypedf = rescaledData.withColumn("indexracist", 'indexracist.cast("double"))
changedTypedf.show()

// val labeled = changedTypedf.map(row => LabeledPoint(row(0), row.getAs[Vector](4)))

val labeled = changedTypedf.select("indexracist","features").rdd.map(row => LabeledPoint(
   row.getAs[Double]("indexracist"),
   org.apache.spark.mllib.linalg.Vectors.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))
))


import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
    // Split data into training (60%) and test (40%).
    val Array(training, test) = labeled.randomSplit(Array(0.6, 0.4))

    val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))    

predictionAndLabel.take(100)

This outputs:

res330: Array[(Double, Double)] = Array((0.0,0.0), (0.0,0.0), (0.0,0.0), (0.0,0.0),

which I assume is an array of (prediction, label) pairs. What I would like to output is these pairs joined to the original text, which was a column called lemma in training dataframe, so something like:

--------------------------------------------------
| Prediction | Label      | lemma                |
--------------------------------------------------
|    0.0     |    0.0     |[cakes, are, good]    |
|    0.0     |    0.0     |[jim, says, hi]       |
|    1.0     |    1.0     |[shut, the, dam, door]|
...
--------------------------------------------------

Any pointers are appreciated as my Spark/Scala is weak.

EDIT, The text column is called 'lemma' in 'indexed':

+------+-------------------------------------------------------------------------------------------------------------------+
|racist|lemma                                                                                                              |
+------+-------------------------------------------------------------------------------------------------------------------+
|true  |[@cllrwood, abbo, @ukip, britainfirst]                                                                             |
|false |[objectofthemonth, george, lansbury, bust, jussuf, abbo, amp, fascinating, insight, son, jerome]                   |
|false |[nowplay, one, night, stand, van, brave, @bbraveofficial, bbravesquad, abbo, safe]                                 |
|false |[@mahesh, weet, son, satyamurthy, kante, abbo, chana, better, aaamovie]                                            |
schoon
  • 2,858
  • 3
  • 46
  • 78
  • https://stackoverflow.com/questions/7539831/scala-draw-table-to-console - probably has the answer you're looking for. I'm not aware of any Spark/Scala OOTB way to do this. – Sergey Khudyakov Aug 18 '17 at 05:36
  • Thanks Sergey. It is not really the formatting that I care about but the combining of the correct rows/fields from the two datasets. Might be obvious to most Scala programmers but not to me! – schoon Aug 18 '17 at 09:31
  • can I get a sample from your input. I'll be working in you solution – Haroun Mohammedi Aug 21 '17 at 19:44

5 Answers5

1

You just need to transform your data and show them as followed :

val predictions = model.transform(test)
predictions.show()
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Thanks Eliasah, I have already done a transform and I can show the predictions but my question is how do I format them nicely with the original dataframe column. – schoon Jul 25 '17 at 05:17
  • I'm not sure I understand what you mean. Would you care updating your question with a minimum complete and verifiable example along with an expected output ? – eliasah Jul 25 '17 at 05:30
  • Er the expected output is the table above. I want to display the pairs from Array((0.0,0.0),... with their corresponding text e.g. [cakes, are, good]. – schoon Jul 25 '17 at 05:44
  • What are those corresponding text labels ? Where are they present ? That's why I have asked you for a minimum example – eliasah Jul 25 '17 at 06:20
  • Ah Ok, fair enough. I have added the rest of my messy code. The text column is called 'lemma' in 'indexed'. – schoon Jul 25 '17 at 07:14
  • try IndexToString, it will replace your indexed values back to initial string value . Refer -> http://spark.apache.org/docs/2.2.0/ml-features.html#indextostring – Manikandan Balasubramanian Aug 21 '17 at 05:08
1

Try using ml package instead of mllib package. for example refer https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala

import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.SparkSession

object NaiveBayesExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("NaiveBayesExample")
      .getOrCreate()


    // Load the data stored in LIBSVM format as a DataFrame.
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

    // Split the data into training and test sets (30% held out for testing)
    val Array(trainingData, testData) = data.randomSplit(Array(0.6, 0.4))

    // Train a NaiveBayes model.
    val model = new NaiveBayes()
      .fit(trainingData)

    // Select example rows to display.
    val predictions = model.transform(testData)
    predictions.show()

    // Select (prediction, true label) and compute test error
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println("Test set accuracy = " + accuracy)


    spark.stop()
  }
}
  • Thansk Mani. What advantage does it have? – schoon Aug 18 '17 at 14:43
  • As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. Refer the below links http://spark.apache.org/docs/2.2.0/quick-start.html http://spark.apache.org/docs/2.2.0/ml-guide.html @schoon – Manikandan Balasubramanian Aug 21 '17 at 05:03
1

As some answers said : It's recommended that you use ml package instead of mllib package since spark 2.0

After rewriting your code using ml package, the answer of your question will be very straightforward : just selecting the right columns should answer your needs

import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}

val dfLemma2 = dfLemma.withColumn("racist", 'racist.cast("String"))

val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")

val hashingTF = new HashingTF()
  .setInputCol("lemma")
  .setOutputCol("rawFeatures")
  .setNumFeatures(20)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

val naiveBayes =
  new NaiveBayes().setLabelCol("indexracist").setFeaturesCol("features").setModelType("multinomial").setSmoothing(1.0)

val pipeline = new Pipeline().setStages(Array(indexer, hashingTF, idf, naiveBayes))

val Array(training, test) = dfLemma2.randomSplit(Array(0.6, 0.4))

val model = pipeline.fit(training)

val predictionAndLabel = model.transform(test).select('Prediction, 'racist, 'indexracist, 'lemma)

predictionAndLabel.take(100)

Hope it helps, otherwise comment your problems

Haroun Mohammedi
  • 2,404
  • 12
  • 25
  • This is fine, but how to add columns from data frame which were not part of features for Classifier-- mostly raw info.. which was used to create features but not used as features – Nitin Jul 26 '18 at 12:38
0

While selecting the output columns to be displayed, try to include the "lemma" column also so that it will get written along with label and features columns.

For more details please refer How to create correct data frame for classification in Spark ML. This post is little bit similar to your question and check whether it helps

JKC
  • 2,498
  • 6
  • 30
  • 56
-2

We have to use pipeline to get the training columns other than prediction columns

openbrace
  • 1
  • 1