How create a training file for Spark MLlib Naive Bayes and calculate TF–IDF

Question

I need to classification a lot of products in a category tree, I'm testing with Spark and Mlib Naive Bayes. But I don't understand how I can calculate the TF-IDF.

I have a trainer file like this:

#filenameTrainer:
103,355 4 50 60 71 72 66 73 57 53
103,35 45 55 65 75 85 66 73 57 53
104,355 41 51 61 71 72 67 73 58 54

etc.etc.

Where the first column is the category id, and the others are the words converted into an index.

this is the (pseudo) code than I use for training:

val conf = new SparkConf()
  .setAppName("SparseNaiveBayes test")
  .setMaster("local[1]")
  .set("spark.executor.memory", "2g")

val sc = new SparkContext(conf)
val trainData = MLUtils.loadLabeledPoints(sc, filenameTrainer);
val trained:NaiveBayesModel = NaiveBayes.train(trainData);

well, if I try to search a category:

val testData:Vector =  Vectors.dense(Array[Double](3, 35,45,55,65,75,85,66,73,92 ))
val result:Double = trained.predict(testData)
println("Result = " + result)

The result is correct, it is return a category 103: Result = 103.0

Now the question is, how can I calculate the TF–IDF for the trainer file?

possible duplicate of [How can I create a TF-IDF for Text Classification using Spark?](http://stackoverflow.com/questions/24548290/how-can-i-create-a-tf-idf-for-text-classification-using-spark) — eliasah, Jul 08 '15 at 15:26
@eliasah yes, the question is similar, do you have resolved? the answers did not solve the problem — faster2b, Jul 08 '15 at 15:41
Did you read the answers? The given answers are a good start. I actually solved it but I didn't have time to write a proper explained answer — eliasah, Jul 08 '15 at 17:12

How create a training file for Spark MLlib Naive Bayes and calculate TF–IDF

0 Answers0